Well, I'm starting to study unicode now, and I had several doubts, at this moment I'm learning what a plane is, I saw that a plane is a set of 2^16 code points, and that utf-16 encoding supports 17 plans enumerated from 0 to 16, well my question is the following, if utf-16 supports up to 32 bits, because in practice it only encodes up to 2^20 code points? where does 20 come from? I know that if a code point requires more than 2 bytes, utf-16 uses two 16-bit units, but how does that fit into all of this, the final question is where does this 2^20 come from and not 2^32 ? Thanks, :)
Have a look at how surrogate pairs encode a character U >= 0x10000
:
U' = yyyyyyyyyyxxxxxxxxxx // U - 0x10000
W1 = 110110yyyyyyyyyy // 0xD800 + yyyyyyyyyy
W2 = 110111xxxxxxxxxx // 0xDC00 + xxxxxxxxxx
(source)
As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.
(Technically, you additionally have some values for U < 0x10000
, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 is U+10FFFF
and not 2^20 = 0x100000
.)