javaaccent-insensitivestring-utils

Convert accent characters to english using java


I have a requirement where i need to search with accent characters that can be for users from Iceland and Japan. The code which i wrote works for a few accent characters but not all. Below example -

À - returns a. Correct.
 - returns a. Correct.
Ð - returns Ð. This is breaking. It should return e.
Õ - returns Õ. This is breaking. It should return o.

Below is my code :-

String accentConvertStr = StringUtils.stripAccents(myKey);

Tried this too :-

byte[] b = key.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));

Please advise.


Solution

  • I would say it works as expected. The underlying code of StringUtils.stripAccents is actually following.

    String[] chars  = new String[]{"À","Â","Ð","Õ"};
    
    for(String c : chars){
      String normalized = Normalizer.normalize(c,Normalizer.Form.NFD);
      System.out.println(normalized.replaceAll("\\p{InCombiningDiacriticalMarks}+", ""));
    }
    

    This will output: A A Ð O

    If you read https://stackoverflow.com/a/5697575/9671280 answer, you will find

    Be aware that that will not remove what you might think of as “accent” marks from all characters! There are many it will not do this for. For example, you cannot convert Đ to D or ø to o that way. For that, you need to reduce code points to those that match the same primary collation strength in the Unicode Collation Table.

    You could handle it separately if you still want to use StringUtil.stripAccents.

    Please try https://github.com/xuender/unidecode it seems to work for your case.

     String normalized = Unidecode.decode(input);