When I execute the code:
char[] c1 = {'a','b'};
char[] c2 = Character.toChars(0x10437);
System.out.println(c1);
System.out.println(c2);
I get:
ab
𐐷
By reading the documentation of Character.toChars
, I get that c2
is a char array of size 2, which is the surrogate pair that represents 0x10437
in UTF-16.
What I don't understand is that c1
and c2
are both char arrays of length 2, then how does println
know that c1
represents 2 characters ('a' and 'b') while c2
represents only 1 character 𐐷? Why println
doesn't recognize c2
as two characters?
The javadoc for Character.tochars(int)
states (with my emphasis added):
Converts the specified character (Unicode code point) to its UTF-16 representation stored in a char array. If the specified code point is a BMP (Basic Multilingual Plane or Plane 0) value, the resulting char array has the same value as codePoint. If the specified code point is a supplementary code point, the resulting char array has the corresponding surrogate pair.
So, as you point out, your statement char[] c2 = Character.toChars(0x10437);
causes the high-surrogate of that code point to be placed in c2[0]
, and the low-surrogate of the code point to be placed in c2[1]
;
You can display the hex values of that surrogate pair:
System.out.printf("High: 0x%X%n", (int)c2[0]);
System.out.printf("Low: 0x%X%n", (int)c2[1]);
This is the output from that code:
High = 0xD801
Low = 0xDC37
Now println()
"knows" to treat those two array elements as a surrogate pair for rendering a single character because surrogate values are in specific ranges:
(Somewhat confusingly, high surrogate values are lower than low surrogate values.) See section 3.8 Surrogates of the Unicode Standard for the rules, but this is the crucial point: "Isolated surrogate code units have no interpretation on their own". So when println()
encounters the value 0xD801 it knows that it can only be a high surrogate (and nothing else), and that it must be followed by a low surrogate in the range U+DC00 to U+DFFF. It can then construct a code point from that surrogate pair, and prints the character represented by that code point.
A lot of confusion arises from the multiple meanings of the word "character" in Java, Unicode and plain English. Generally we think of a one-to-one mapping between some digital character value (i.e. code point) and its visual representation, so the Unicode character U+0077 represents the visual character w
, and the Java character (char
) 'w'. But in your example, the surrogate pair U+D801 and 0xDC37 do not represent individual characters at all, and they are only meaningful when combined together to form the code point (character) x10437, which is represented by the visual character 𐐷
.