I need to replace diacritic characters (e.g. ä, ó, etc.) with their 'base' character. For most of the characters, this solution works:
StringUtils.stripAccents(tmpStr);
but this misses four characters: æ, œ, ø, and ß.
I took a look at this solution here Is there a way to get rid of accents and convert a whole string to regular letters?. I figured the first solution would work, but it does not.
How can I replace these characters with their 'base' character (e.g. replace æ with a).
The source code says (https://commons.apache.org/proper/commons-lang/apidocs/src-html/org/apache/commons/lang3/StringUtils.html),
public static String stripAccents(final String input) {
if (input == null) {
return null;
} final StringBuilder decomposed = new StringBuilder(Normalizer.normalize(input, Normalizer.Form.NFD)); convertRemainingAccentCharacters(decomposed);
// Note that this doesn't correctly remove ligatures...
return STRIP_ACCENTS_PATTERN.matcher(decomposed).replaceAll(EMPTY);
}
It has a comment that says,
// Note that this doesn't correctly remove ligatures...
So may be you need to manually replace those instances. Something like,
String string = Normalizer.normalize("Tĥïŝ ĩš â fůňķŷ ß æ œ ø Šťŕĭńġ", Normalizer.Form.NFKD);
string = string.replaceAll("\\p{M}", "");
string = string.replace("ß", "s");
string = string.replace("ø", "o");
string = string.replace("œ", "o");
string = string.replace("æ", "a");
Diacritical Character to ASCII Character Mapping https://docs.oracle.com/cd/E29584_01/webhelp/mdex_basicDev/src/rbdv_chars_mapping.html