I'm reading UTF-8 Encoding, and I don't understand the following sentence.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is spread across two bytes. The first byte will have the two high bits set and the third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set and the second bit clear (i.e. 0x80 to 0xBF).
If I'm not mistaken, this means UTF-8 requires two bytes to represent 2048 characters. In other words, we need to choose 2048 candidates from 2^16 options to represent each character.
Based on the byte ranges given (0xC2-0xDF and 0x80-0xBF), I calculated that there are 30 * 64 = 1920 possible combinations. However, the text also mentions 2048 characters.
How does 1920 numbers accommodate 2048 combinations? Is there a mistake in my calculation or in the text? Or is there something I'm misunderstanding about how these byte ranges relate to the number of representable characters?
Think of it as a container, where the encoding reserves a few bits for its own synchronization, and you get to use the remaining bits.
So for the range in question, the encoding "template" is
110 abcde 10 fghijk
(where I have left a single space to mark the boundary between the template and the value from the code point we want to encode, and two spaces between the actual bytes)
and you get to use the 11 bits abcdefghijk
for the value you actually want to transmit.
So for the code point U+07EB you get
0x07 00000111
0xEB 11101011
where the top five zero bits are masked out (remember, we only get 11 -- because the maximum value that the encoding can accommodate in two bytes is 0x07FF. If you have a larger value, the encoding will use a different template, which is three bytes) and so
0x07 = _____ 111 (template: _____ abc)
0xEB = 11 101011 (template: de fghijk)
abc de = 111 11 (where the first three come from 0x07, and the next two from 0xEB)
fghijk = 101011 (the remaining bits from 0xEB)
yielding the value
110 11111 10 101011
aka 0xDF 0xAB.
Wikipedia's article on UTF-8 contains more examples with nicely colored numbers to see what comes from where.
The range 0x00-0x7F, which can be represented in a single byte, contains 128 code points; the two-byte range thus needs to accommodate 1920 = 2048-128 code points.
The raw encoding would allow values in the range 0xC0-0xBF in the first byte, but the values 0xC0 and 0xC1 are not ever needed because those would represent code points which can be represented in a single byte, and thus are invalid as per the encoding spec. In other words, the 0x02 in 0xC2 comes from the fact that at least one bit in the high four bits out of the 11 that this segment of the encoding can represent (one of abcd
) needs to be a one bit in order for the value to require two bytes.