javascriptjsonunicodexojo

How do JSON parsers encode unicode characters not in the basic multilingual plane?


I'm writing a JSON parser in Xojo. It's working apart from the fact that I can't figure out how to encode and decode unicode strings that are not in the basic multilingual plane (BMP). In other words, my parser dies if encounters something greater than \uFFFF.

The specs say:

To escape a code point that is not in the Basic Multilingual Plane, the character may be represented as a twelve-character sequence, encoding the UTF-16 surrogate pair corresponding to the code point. So for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E". However, whether a processor of JSON texts interprets such a surrogate pair as a single code point or as an explicit surrogate pair is a semantic decision that is determined by the specific processor.

What I don't understand is what is the algorithm to go from U+1D11E to \uD834\uDD1E. I can't find any explanation of how to "encode the UTF-16 surrogate pair corresponding to the code point".

For example, say I want to encode the smiley face character (U+1F600). What would this be as a UTF-16 surrogate pair and what is the working to derive it?

Could somebody please at least point me in the correct direction?


Solution

  • Taken from the Wikipedia article linked by Remy Lebeau in the comments above (link):

    To encode U+10437 (𐐷) to UTF-16:

    Subtract 0x10000 from the code point, leaving 0x0437. For the high surrogate, shift right by 10 (divide by 0x400), then add 0xD800, resulting in 0x0001 + 0xD800 = 0xD801. For the low surrogate, take the low 10 bits (remainder of dividing by 0x400), then add 0xDC00, resulting in 0x0037 + 0xDC00 = 0xDC37. To decode U+10437 (𐐷) from UTF-16:

    Take the high surrogate (0xD801) and subtract 0xD800, then multiply by 0x400, resulting in 0x0001 × 0x400 = 0x0400. Take the low surrogate (0xDC37) and subtract 0xDC00, resulting in 0x37. Add these two results together (0x0437), and finally add 0x10000 to get the final decoded UTF-32 code point, 0x10437.