javautf-8

How do I convert UTF-8 in hex to its code point?


I have a String e2 80 99 which is a Hex representation of a UTF-8 character. The string represents

U+2019  ’   e2 80 99    RIGHT SINGLE QUOTATION MARK

I want to convert e2 80 99 to its corresponding Unicode code point which is U+2019 or even ' (single quotation).

How do I do it?


Solution

  • Basically you need to get a String representation of the character encoded with utf-8, then get the first character of the resulting String (or first + second if the resulting character is represented as two surrogates in UTF-16). This is a proof of concept:

    public static void main(String[] args) throws Exception {
    
        // Convert your representation of a char into a String object: 
        String utf8char = "e2 80 99";
        String[] strNumbers = utf8char.split(" ");
        byte[] rawChars = new byte[strNumbers.length];
        int index = 0;
        for(String strNumber: strNumbers) {
            rawChars[index++] = (byte)(int)Integer.valueOf(strNumber, 16);
        }
        String utf16Char = new String(rawChars, Charset.forName("UTF-8"));
    
        // get the resulting characters (Java Strings are "encoded" in UTF16)
        int codePoint = utf16Char.charAt(0);
        if(Character.isSurrogate(utf16Char.charAt(0))) {
            codePoint = Character.toCodePoint(utf16Char.charAt(0), utf16Char.charAt(1));
        }
        System.out.println("code point: " + Integer.toHexString(codePoint));
    }