utf-8utf-16utfunicode-normalizationutf-16le

Why does utf-16 only support 2^20 code points?


Well, I'm starting to study unicode now, and I had several doubts, at this moment I'm learning what a plane is, I saw that a plane is a set of 2^16 code points, and that utf-16 encoding supports 17 plans enumerated from 0 to 16, well my question is the following, if utf-16 supports up to 32 bits, because in practice it only encodes up to 2^20 code points? where does 20 come from? I know that if a code point requires more than 2 bytes, utf-16 uses two 16-bit units, but how does that fit into all of this, the final question is where does this 2^20 come from and not 2^32 ? Thanks, :)


Solution

  • Have a look at how surrogate pairs encode a character U >= 0x10000:

    U' = yyyyyyyyyyxxxxxxxxxx  // U - 0x10000
    W1 = 110110yyyyyyyyyy      // 0xD800 + yyyyyyyyyy
    W2 = 110111xxxxxxxxxx      // 0xDC00 + xxxxxxxxxx
    

    (source)

    As you can see, from the 32 bits of the 2x16 surrogate pair, 2x6 = 12 bits are used "only" to convey the information that this is indeed a surrogate pair (and not simply two characters with a value < 0x10000). This leaves you with 32 - 12 = 20 bits to store U'.

    (Technically, you additionally have some values for U < 0x10000, of which again some are reserved for low and high surrogates, which means you end up slightly above 2^20 codepoints which can be encoded by UTF-16 (but still well below 2^21), considering that the highest possible codepoint that is supported by UTF-16 is U+10FFFF and not 2^20 = 0x100000.)