My question arises from this answer, which says:
Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.
If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3
and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4
?
It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).
0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.
How that certain number is encoded, might take up more bytes than the raw code point would.
Short calculation of UTF-8 encoding of given character:
To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1
bits followed by a 0
bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10
.
2124hex = 10000100100100bin
According to abovementioned rules, this converts to the following UTF-8 encoding:
11100010 10000100 10100100 <-- Our UTF-8 encoded result
^ ^ ^ ^ ^ ^ ^
AaaaBbDd CcDddddd CcDddddd <-- Some notes, explained below
A
is a set of ones followed by a zero, which denote the number of bytes belonging to this character (three 1
s = three bytes).B
is padding with two zeros, because otherwise the total number of bits is not divisible by 8.C
is the concatenation bits (each subsequent byte starting with 10
).D
is the actual bits of our code point.So A
and C
are bits to denote the number of byes a code point takes up, while B
and D
form the actual data.
So indeed, the character ℤ takes up three bytes.