javaunicodenormalizationdigitunicode-normalization

How to normalize Unicode digits in Java


Is there any Java API to normalize Unicode digits into ASCII digits?

There is a normalization API in JDK and ICU4J which seems not to able to handle this kind of normalization (since it's probably not called normalization by Unicode standard)

What I need is to convert all forms of Unicode digits (listed in this post) into [0-9]. A possible messy solution is 10 replace-all for any digit from 0 to 9.


Solution

  • UPDATE

    This is possible using ICU4J Transliteration API. The following transliterator removes any non-ASCII character from a String except a-z, A-Z, 0-9 and dash (minus).

    Transliterator trans = Transliterator.getInstance("Any-Latin; NFD; [^a-zA-Z0-9-] Remove");
    System.out.println(trans.transform("۱۲۳456"));
    

    Will print:

    123456
    

    Another messy solution

    static final Pattern DIGIT_0 = Pattern.compile("[٠۰߀०০੦૦୦௦౦೦൦๐໐0]");
    static final Pattern DIGIT_1 = Pattern.compile("[١۱߁१১੧૧୧௧౧೧൧๑໑1]");
    static final Pattern DIGIT_2 = Pattern.compile("[٢۲߂२২੨૨୨௨౨೨൨๒໒2]");
    static final Pattern DIGIT_3 = Pattern.compile("[٣۳߃३৩੩૩୩௩౩೩൩๓໓3]");
    static final Pattern DIGIT_4 = Pattern.compile("[٤۴߄४৪੪૪୪௪౪೪൪๔໔4]");
    static final Pattern DIGIT_5 = Pattern.compile("[٥۵߅५৫੫૫୫௫౫೫൫๕໕5]");
    static final Pattern DIGIT_6 = Pattern.compile("[٦۶߆६৬੬૬୬௬౬೬൬๖໖6]");
    static final Pattern DIGIT_7 = Pattern.compile("[٧۷߇७৭੭૭୭௭౭೭൭๗໗7]");
    static final Pattern DIGIT_8 = Pattern.compile("[٨۸߈८৮੮૮୮௮౮೮൮๘໘8]");
    static final Pattern DIGIT_9 = Pattern.compile("[٩۹߉९৯੯૯୯௯౯೯൯๙໙9��]");
    
    public static final Pattern[] DIGIT_PATTERN_LIST = { DIGIT_0, DIGIT_1, DIGIT_2, DIGIT_3, DIGIT_4, DIGIT_5, DIGIT_6, DIGIT_7, DIGIT_8,
            DIGIT_9 };
    
    /**
     * Converts any Unicode digits into their ASCII equivalent. For example given 23۹٤۴ returns 23944
     * 
     * @param str
     * @return
     */
    public static String normalizeUnicodeDigits(String str) {
        for (int i = 0; i < DIGIT_PATTERN_LIST.length; i++) {
            Pattern dp = DIGIT_PATTERN_LIST[i];
            str = dp.matcher(str).replaceAll(String.valueOf(i));
        }
        return str;
    }