javascripttypescriptutf-8noncharacter

Is this Google Closure UTF-8 string valid?


In the Google Closure UTF-8 to byte array tests is the string

\u0000\u007F\u0080\u07FF\u0800\uFFFF

which is supposed to be converted to the array

[0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF]

I've tried a few other JavaScript and TypeScript UTF-8-to-byte array implementations and they claim that the UTF-8 string is invalid.

The string appears to cover the values that transition from 1 byte to 2 byte to 3 byte values.

Is Google correct or the other libraries?


Solution

  • Google is correct.

    The string '\u0000\u007F\u0080\u07FF\u0800\uFFFF' represents Unicode codepoints U+0000 U+007F U+0080 U+07FF U+0800 U+FFFF.

    The literal translation of those codepoints to UTF-8 is indeed bytes 00 7F C2 80 DF BF E0 A0 80 EF BF BF, just as Google says.

    Note that U+FFFF is a non-character codepoint, per the Unicode standard:

    A "noncharacter" is a code point that is permanently reserved in the Unicode Standard for internal use

    ...

    In Unicode 1.0 the code points U+FFFE and U+FFFF were annotated in the code charts as "Not character codes" and instead of having actual names were labeled "NOT A CHARACTER". The term "noncharacter" in later versions of the standard evolved from those early annotations and labels.

    In particular:

    Q: Are noncharacters intended for interchange?

    A: No. They are intended explicity for internal use. For example, they might be used internally as a particular kind of object placeholder in a string. Or they might be used in a collation tailoring as a target for a weighting that comes between weights for "real" characters of different scripts, thus simplifying the support of "alphabetic index" implementations.

    Q: Are noncharacters prohibited in interchange?

    A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.

    Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.

    Q: Are noncharacters invalid in Unicode strings and UTFs?

    A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.