javautf-8windows-1255

Gussing the text encoding from a UFT-8 BOM file in Java 6


I'm getting txt files in both Hebrew and Arabic with a UTF-8 BOM encoding. I need to convert them to a Windows-1255 or Windows-1256 depending on the content.

How can I know, in runtime, the correct encoding to use?

No luck with Mosilla UniversalDetector, nor with any other solution that I've found. Any ideas? (I need to do it with Java 6. Don't ask why...)


Solution

  • As of java 1.7 the Character class knows of Unicode scripts like Arabic and Hebrew.

    int freqs = s.codePoints().map(cp ->
            Character.UnicodeScript.of(cp) == Character.UnicodeScript.ARABIC ? 1
            : Character.UnicodeScript.of(cp) == Character.UnicodeScript.HEBREW ? -1
            : 0).sum();
    

    For java 1.6 the directionality might be sufficient, as there is a RIGHT_TO_LEFT and a RIGHT_TO_LEFT_ARABIC:

        String s = "אבגדהאבגדהصضطظع"; // First Hebrew, then Arabic.
        int i0 = 0;
        for (int i = 0; i < s.length(); ) {
            int codePoint = s.codePointAt(i);
            i += Character.charCount(codePoint);
            boolean rtl = Character.getDirectionality(codePoint)
                    == Character.DIRECTIONALITY_RIGHT_TO_LEFT;
            boolean rtl2 = Character.getDirectionality(codePoint)
                    == Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC;
            System.out.printf("[%d - %d] '%s': LTR %s %s%n",
                    i0, i, s.substring(i0,  i), rtl, rtl2);
            i0 = i;
        }
    
    [0 - 1] 'א': LTR true false
    [1 - 2] 'ב': LTR true false
    [2 - 3] 'ג': LTR true false
    [3 - 4] 'ד': LTR true false
    [4 - 5] 'ה': LTR true false
    [5 - 6] 'א': LTR true false
    [6 - 7] 'ב': LTR true false
    [7 - 8] 'ג': LTR true false
    [8 - 9] 'ד': LTR true false
    [9 - 10] 'ה': LTR true false
    [10 - 11] 'ص': LTR false true
    [11 - 12] 'ض': LTR false true
    [12 - 13] 'ط': LTR false true
    [13 - 14] 'ظ': LTR false true
    [14 - 15] 'ع': LTR false true
    

    So

    int arabic(String s) {
        int n = 0;
        for (char ch : s.toCharArray()) {
            if (Character.getDirectionality(codePoint)
                    == Character.DIRECTIONALITY_RIGHT_TO_LEFT_ARABIC) {
                ++n;
                if (n > 1000) {
                    break;
                }
            }
        }
        return n;
    }
    int hebrew(String s) {
        int n = 0;
        for (char ch : s.toCharArray()) {
            if (Character.getDirectionality(codePoint)
                    == Character.DIRECTIONALITY_RIGHT_TO_LEFT) {
                ++n;
                if (n > 1000) {
                    break;
                }
            }
        }
        return n;
    }
    
    if (arabic(s) > 0) {
        return "Windows-1256";
    } else if (hebrew(s) > 0) {
        return "Windows-1255";
    } else {
        return "Klingon-1257";
    }