javastring

How to get an alphanumeric String from any string in Java?


Possible Duplicate:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars
How to replace special characters in a string?

I would like to format some String such as "I>Télé" to something like "itele". The idea is that I want my String to be lower case (done), without whitespaces (done), no accents or special characters (like >, <, /, %, ~, é, @, ï etc).

It is okay to delete occurences of special characters, but I want to keep letters while removing accents (as I did in my example). Here is what I did, but I don't think that the good solution is to replace every é,è,ê,ë by "e", than do it again for "i","a" etc, and then remove every special character...

String name ="I>télé" //example
String result = name.toLowerCase().replace(" ", "").replace("é","e").........;

The purpose of that is to provide a valid filename for resources for an Android app, so if you have any other idea, I'll take it !


Solution

  • You can use the java.text.Normalizer class to convert your text into normal Latin characters followed by diacritic marks (accents), where possible. So for example, the single-character string "é" would become the two character string ['e', {COMBINING ACUTE ACCENT}].

    After you've done this, your String would be a combination of unaccented characters, accent modifiers, and the other special characters you've mentioned. At this point you could filter the characters in your string using only a whitelist to keep what you want (which could be as simple as [A-Za-z0-9] for a regex, depending on what you're after).

    An approach might look like:

    String name ="I>télé"; //example
    String normalized = Normalizer.normalize(name, Form.NFD);
    String result = normalized.replaceAll("[^A-Za-z0-9]", "");