Java stores a character as an UTF-16 encoded codepoint(s). So when a character which needs two codepoints is stored it is split into two characters. So when I print the following string this happens:
(Tested with random two codepoint long character: https://codepoints.net/U+1230D)
String s = "\uD808\uDF0D";
System.out.println(s);
System.out.println(s.length());
System.out.println(s.charAt(0));
System.out.println(s.charAt(1));
Output:
𒌍
2
?
?
As expected it prinst garbage when only printing one part of the character.
But how can I now know that a character consists of two codepoints?
I guess by a special bit set in the first or second part?
I think the explanation is somewhere in here (https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF) but I don't really understand it.
char
The char
type has been essentially broken since Java 2, legacy since Java 5. As a 16-bit value, char
is physically incapable of representing most of the 154,998 characters defined in Unicode.
Do not use char
. Do not call String#length
. Do not call charAt
.
Learn to use only code point integer numbers to work with individual characters.
You will find code point related methods on various classes, such as String
, StringBuilder
, and Character
.
int[] codePoints = “😷".codePoints().toArray() ; // One character, one code point. But `String#length` reports `2`. So don’t call `String#length`.
Be aware that what the human reader perceives as a single character may be composed of multiple code points. In this next example we use a COMBINING ACCENT ACUTE (code point: U+0301 hex, 769 decimal) as the fifth code point.
String x = "cafe\u0301" ;
Alternate syntax:
String x = new StringBuilder( "cafe" ).appendCodePoint( 769 ).toString() ;
The naïve reader perceives four characters, but the text is actually composed of five code points. Get the count of code points:
long countCodePoints = x.codePoints().count() ;
See this code run at Ideone.com.
café
5
See each of the code points.
x.codePoints().forEach( System.out :: println ) ;
When run:
99
97
102
101
769
See the name of each code point.
x.codePoints ( ).mapToObj ( Character :: getName ).forEach ( System.out :: println );
LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E
COMBINING ACUTE ACCENT