javaunicode

If 'ℤ' is in the BMP, why isn't it encoded in 2 bytes?


My question arises from this answer, which says:

Since 'ℤ' (0x2124) is in the basic multilingual plane it is represented by a single code unit.

If that's correct, then why is "ℤ".getBytes(StandardCharsets.UTF_8).length == 3 and "ℤ".getBytes(StandardCharsets.UTF_16).length == 4?


Solution

  • It seems you're mixing up two things: the character set (Unicode) and their encoding (UTF-8 or UTF-16).

    0x2124 is only the 'sequence number' in the Unicode table. Unicode is nothing more than a bunch of 'sequence numbers' mapped to certain characters. Such a sequence number is called a code point, and it's often written down as a hexadecimal number.

    How that certain number is encoded, might take up more bytes than the raw code point would.


    Short calculation of UTF-8 encoding of given character:
    To know which bytes belong to the same character, UTF-8 uses a system where the first byte starts with a certain amount (lets call it N) of 1 bits followed by a 0 bit. N is the number of bytes the character takes up. The remaining bytes (N – 1) start with bits 10.

    2124hex = 10000100100100bin

    According to abovementioned rules, this converts to the following UTF-8 encoding:

    11100010 10000100 10100100    <-- Our UTF-8 encoded result
    ^   ^ ^  ^ ^      ^ ^
    AaaaBbDd CcDddddd CcDddddd    <-- Some notes, explained below
    

    So A and C are bits to denote the number of byes a code point takes up, while B and D form the actual data.

    So indeed, the character ℤ takes up three bytes.