javaunicodediacriticstransliteration

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars


I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

For example:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n
á --> a
ä --> a
ấ --> a
ṏ --> o

Etc.

  1. I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

  2. Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.


Solution

  • I have done this recently in Java:

    public static final Pattern DIACRITICS_AND_FRIENDS
        = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}\\u0591-\\u05C7]+");
    
    private static String stripDiacritics(String str) {
        str = Normalizer.normalize(str, Normalizer.Form.NFD);
        str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
        return str;
    }
    

    This will do as you specified:

    stripDiacritics("Björn")  = Bjorn
    

    but it will fail on for example Białystok, because the ł character is not diacritic.

    If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

    public class StringSimplifier {
        public static final char DEFAULT_REPLACE_CHAR = '-';
        public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
        private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()
    
            //Remove crap strings with no sematics
            .put(".", "")
            .put("\"", "")
            .put("'", "")
    
            //Keep relevant characters as seperation
            .put(" ", DEFAULT_REPLACE)
            .put("]", DEFAULT_REPLACE)
            .put("[", DEFAULT_REPLACE)
            .put(")", DEFAULT_REPLACE)
            .put("(", DEFAULT_REPLACE)
            .put("=", DEFAULT_REPLACE)
            .put("!", DEFAULT_REPLACE)
            .put("/", DEFAULT_REPLACE)
            .put("\\", DEFAULT_REPLACE)
            .put("&", DEFAULT_REPLACE)
            .put(",", DEFAULT_REPLACE)
            .put("?", DEFAULT_REPLACE)
            .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?
            .put("|", DEFAULT_REPLACE)
            .put("<", DEFAULT_REPLACE)
            .put(">", DEFAULT_REPLACE)
            .put(";", DEFAULT_REPLACE)
            .put(":", DEFAULT_REPLACE)
            .put("_", DEFAULT_REPLACE)
            .put("#", DEFAULT_REPLACE)
            .put("~", DEFAULT_REPLACE)
            .put("+", DEFAULT_REPLACE)
            .put("*", DEFAULT_REPLACE)
    
            //Replace non-diacritics as their equivalent characters
            .put("\u0141", "l") // BiaLystock
            .put("\u0142", "l") // Bialystock
            .put("ß", "ss")
            .put("æ", "ae")
            .put("ø", "o")
            .put("©", "c")
            .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90
            .put("\u00F0", "d")
            .put("\u0110", "d")
            .put("\u0111", "d")
            .put("\u0189", "d")
            .put("\u0256", "d")
            .put("\u00DE", "th") // thorn Þ
            .put("\u00FE", "th") // thorn þ
            .build();
    
    
        public static String simplifiedString(String orig) {
            String str = orig;
            if (str == null) {
                return null;
            }
            str = stripDiacritics(str);
            str = stripNonDiacritics(str);
            if (str.length() == 0) {
                // Ugly special case to work around non-existing empty strings
                // in Oracle. Store original crapstring as simplified.
                // It would return an empty string if Oracle could store it.
                return orig;
            }
            return str.toLowerCase();
        }
    
        private static String stripNonDiacritics(String orig) {
            StringBuilder ret = new StringBuilder
            String lastchar = null;
            for (int i = 0; i < orig.length(); i++) {
                String source = orig.substring(i, i + 1);
                String replace = NONDIACRITICS.get(source);
                String toReplace = replace == null ? String.valueOf(source) : replace;
                if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {
                    toReplace = "";
                } else {
                    lastchar = toReplace;
                }
                ret.append(toReplace);
            }
            if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {
                ret.deleteCharAt(ret.length() - 1);
            }
            return ret.toString();
        }
    
    /*
        Special regular expression character ranges relevant for simplification:
        - InCombiningDiacriticalMarks: diacritic marks used in many languages
        - IsLm: Letter, Modifier (see http://www.fileformat.info/info/unicode/category/Lm/list.htm)
        - IsSk: Symbol, Modifier (see http://www.fileformat.info/info/unicode/category/Sk/list.htm)
        - U+0591 to U+05C7: Range for Hebrew diacritics (niqqud) 
          (see official Unicode chart: https://www.unicode.org/charts/PDF/U0590.pdf)
    */
    public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile(
        "[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}\\u0591-\\u05C7]+"
    );
    
    
        private static String stripDiacritics(String str) {
            str = Normalizer.normalize(str, Normalizer.Form.NFD);
            str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
            return str;
        }
    }