javaascii

Convert extended Ascii (>127) to standard Ascii (<128) in Java


We receive UTF-8 compliant data from a 3rd party system. Our system can handle it since it is also UTF-8 compliant. The problem is that old downstream systems can not always handle characters with a decimal value > 127 and either break or display the data incorrectly.

Since we do not have control over the downstream systems, the only way to fix this issue is to convert the "extended" Ascii characters to their "base" (Ascii < 128) values e.g. ê, ë must become e, or ò, ö must become o, etc.

Is there a way to achieve this in Java without having to hard code the mappings?


Solution

  • You can 'normalize' utf-8, which seperates diacritical marks and vowels (the string will look the same) then strip the diacritical marks. These actions effectively turns "é" into "e" (but also "ä" into "a")1

    Afterwards, escape all other characters that may be in your input to their hex-codes using UnicodeEscaper from apache-commons. This won't be human-readable, but it will protect your legacy systems from special characters.

    @Test
    void test() {
        String input = "äéöíæ";
        String normalized = asciify(input);
        System.out.println("Original: " + input);
        System.out.println("Normalized: " + normalized);
    }
    
    public static String asciify(String input) {
        String s = Normalizer.normalize(input, Normalizer.Form.NFD)
                .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
        UnicodeEscaper escaper = UnicodeEscaper.above(127);
        return escaper.translate(s);
    }
    

    prints:

    Original: äéöíæ
    Normalized: aeoi\u00E6
    

    1 if you want to support conversions like "ä"->"ae", hardcode them and replace them before this step, but you indicated in your comments that you do not need German and a quick google-search told me that German is actually an outlier with these two-character conversions.