unicodeutf-16

How does UTF-16 encoding use surrogate code points?


According to the Unicode specification

D91 UTF-16 encoding form: The Unicode encoding form that assigns each Unicode scalar value in the ranges U+0000..U+D7FF and U+E000..U+FFFF to a single unsigned 16-bit code unit with the same numeric value as the Unicode scalar value, and that assigns each Unicode scalar value in the range U+10000..U+10FFFF to a surrogate pair.

The term "scalar value" is referred to unicode code points, that is the range of abstract ideas which must be encoded into specific byte sequence by different encoding forms (UTF-16 and so on). So it seems that this excerpt gist is in view of not all code points can be accommodated into one UTF-16 code unit (two bytes), there are ones which should be encoded into a pair of code units - 4 bytes (it's called "a surrogate pair").

However, the very term "scalar value" is defined as follows:

D76 Unicode scalar value: Any Unicode code point except high-surrogate and low-surrogate code points.

Wait... Does the Unicode have surrogate code points? What's the reason for it when UTF-16 can use 4 bytes to represent scalar points? Can anyone explain the rationale and how UTF-16 uses this code points?


Solution

  • Yes, Unicode reserves ranges for surrogate code points:

    Unicode reserves these ranges because these 16-bit values are used in surrogate pairs and no other symbol can be assigned to them. Surrogate pairs are two 16-bit values that encode code points above U+FFFF that do not fit into a single 16-bit value.