javascriptbrowsercharacter-encoding

Why does Wikipedia claim UTF-16 is obsolete when Javascript uses it?


The Wikipedia page for UTF-16 claims that it is obsolete, saying:

UTF-16 is the only encoding (still) allowed on the web that is incompatible with 8-bit ASCII. However it has never gained popularity on the web, where it is declared by under 0.004% of public web pages (and even then, the web pages are most likely also using UTF-8). UTF-8, by comparison, gained dominance years ago and accounted for 99% of all web pages by 2025. The Web Hypertext Application Technology Working Group (WHATWG) considers UTF-8 "the mandatory encoding for all [text]" and that for security reasons browser applications should not use UTF-16.

But browsers support Javascript natively, nearly every interactive webpage uses it, and Javascript strings are UTF-16. What's going on here?


Solution

  • You are confusing UTF-16 as a transmission format with UTF-16 as a codec. What the quoted section of the Wikipedia page says is true; the vast majority of web pages (their HTML, CSS, and JS code) are transmitted with UTF-8 encoding. Often this is explicitly set in the Content-Type header, other times it is specified as a meta tag, however in absence of both of these indicators it can be generally assumed that any textual content is UTF-8 encoded. Ideally these previously mentioned systems would also allow UTF-16 encoded content to be served to a client, however there's very little point to this. UTF-8 and UTF-16 are both capable of representing any Unicode codepoint, with UTF-8 using less space to represent the kinds of characters you're likely to find in HTML/CSS/JS.

    String representation in Javascript, or any language for that matter, is a completely different subject. The way that the runtime stores strings in memory is technically an implementation detail, and reading a string from an external source into the runtime is always encoding-aware. If you have some buffer that you want to turn into a Javascript string, you need to know or infer the text encoding in use by that buffer. This is to say that there is never any case where the literal bytes of some textual content served by an HTTP server is wrapped into a Javascript string. Re-encoding occurs constantly and is the responsible thing to do when working with text from various sources.

    So really at this point, the question is "Why does Javascript use 16-bit characters"? Well, there's some history to acknowledge. To read from the same wikipedia page:

    The encoding is variable-length as code points are encoded with one or two 16-bit code units. UTF-16 arose from an earlier obsolete fixed-width 16-bit encoding now known as UCS-2 (for 2-byte Universal Character Set), once it became clear that more than 216 (65,536) code points were needed, including most emoji and important CJK characters such as for personal and place names.

    Imagine if you will that you are designing the character type for a newly created programming language, and there is a UCS-2 encoding scheme which can encode every character in 16 bits. Using a 16-bit character type would make a lot of sense. It would make the length property intuitive as the number of codepoints and code units are always equal. It would greatly simplify text processing algorithms and ensure that each element corresponds to a single printed character. For this reason, many, many languages adopted 16-bit characters. However, of course, UCS-2 was not actually sufficient to encode every character. The retrofitting of UTF-16 into these 16-bit character systems re-introduced issues that were previously solved. UTF-16 surrogate pairs break the linear relationship between codepoints and code units. "😎".length == 2.

    Ultimately, the simple fact that 16-bit characters were widely adopted means that it has staying power. Developers are already comfortable with the concept and may or may not be aware of the evils of surrogate pairs. The subset of Unicode that can be represented with a single UTF-16 code unit is quite large, especially in comparison to UTF-8 which would constantly be emitting non-scalar encodings for relatively common characters. That's fine for transmission as stated at the start, but for computing it has proven cumbersome.