javaapache-stringutils

Replace Unicode Characters in a String


I need to replace diacritic characters (e.g. ä, ó, etc.) with their 'base' character. For most of the characters, this solution works:

StringUtils.stripAccents(tmpStr);

but this misses four characters: æ, œ, ø, and ß.

I took a look at this solution here Is there a way to get rid of accents and convert a whole string to regular letters?. I figured the first solution would work, but it does not.

How can I replace these characters with their 'base' character (e.g. replace æ with a).


Solution

  • The source code says (https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html),

    public static String stripAccents(final String input) {
        if (input == null) {
            return null;
        }        final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD));        convertRemainingAccentCharacters(decomposed);        
    
        // Note that this doesn't correctly remove ligatures...   
     
        return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);    
    }
    

    It has a comment that says, // Note that this doesn't correctly remove ligatures...

    So may be you need to manually replace those instances. Something like,

        String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
        string = string.replaceAll("\\p{M}", "");
    
        string = string.replace("ß", "s");
        string = string.replace("ø", "o");
        string = string.replace("œ", "o");
        string = string.replace("æ", "a");
    

    Diacritical Character to ASCII Character Mapping https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html