javacharacter-encodingutf-16

Java how to know if char consists of two codepoints?


Java stores a character as an UTF-16 encoded codepoint(s). So when a character which needs two codepoints is stored it is split into two characters. So when I print the following string this happens:

(Tested with random two codepoint long character: https://codepoints.net/U+1230D)

String s = "\uD808\uDF0D";
System.out.println(s);
System.out.println(s.length());
System.out.println(s.charAt(0));
System.out.println(s.charAt(1));

Output:

𒌍
2
?
?

As expected it prinst garbage when only printing one part of the character.

But how can I now know that a character consists of two codepoints?

I guess by a special bit set in the first or second part?

I think the explanation is somewhere in here (https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) but I don't really understand it.


Solution

  • Avoid char

    The char type has been essentially broken since Java 2, legacy since Java 5. As a 16-bit value, char is physically incapable of representing most of the 154,998 characters defined in Unicode.

    Do not use char. Do not call String#length. Do not call charAt.

    Code point

    Learn to use only code point integer numbers to work with individual characters.

    You will find code point related methods on various classes, such as String, StringBuilder, and Character.

    int[] codePoints = “😷".codePoints().toArray() ; // One character, one code point. But `String#length` reports `2`. So don’t call `String#length`. 
    

    Be aware that what the human reader perceives as a single character may be composed of multiple code points. In this next example we use a COMBINING ACCENT ACUTE (code point: U+0301 hex, 769 decimal) as the fifth code point.

    String x = "cafe\u0301" ;
    

    Alternate syntax:

    String x = new StringBuilder( "cafe" ).appendCodePoint( 769 ).toString() ;
    

    The naïve reader perceives four characters, but the text is actually composed of five code points. Get the count of code points:

    long countCodePoints = x.codePoints().count() ; 
    

    See this code run at Ideone.com.

    café

    5

    See each of the code points.

    x.codePoints().forEach( System.out :: println ) ;
    

    When run:

    99
    97
    102
    101
    769
    

    See the name of each code point.

    x.codePoints ( ).mapToObj ( Character :: getName ).forEach ( System.out :: println );
    
    LATIN SMALL LETTER C
    LATIN SMALL LETTER A
    LATIN SMALL LETTER F
    LATIN SMALL LETTER E
    COMBINING ACUTE ACCENT