javascriptunicodenon-englishastral-plane

How to escape a character out of Basic Multilingual Plane?


For characters in Basic Multilingual Plane, we can use '\uxxxx' escape it. For example, you can use /[\u4e00-\u9fff]/ to match a common chinese character(0x4e00-0x9fff is the range of CJK Unified Ideographs).

But for characters out of Basic Multilingual Plane, their codes are bigger than 0xffff. So you can't use format '\uxxxx' to escape it, because '\u20000' means character '\u2000' and character '0', not a character which code is 0x20000.

How can I escape characters out of Basic Multilingual Plane? Use those characters directly is not a good idea, because they can't show in most fonts.


Solution

  • You can use a pair of escaped surrogate code points, as described in @duskwuff’s answer. You can use my Full Unicode input utility to get the notations (button “Show \u”), or use the Fileformat.info character search to find them out (item “C/C++/Java source code”, because JavaScript uses the same notation here).

    Alternatively, you can enter the characters directly: “You can enter non-BMP characters as such into string literals in your JavaScript code,whether in a separate file or as embedded in HTML. Naturally, you need suitable Unicode support in the editor you use. But JavaScript implementations need not support non-BMP characters in program source. They may, and modern browser implementations generally do.” (Going Global with JavaScript and Globalize.js, p. 177) There are some caveats like properly declaring the character encoding.

    Font support is a different issue, but when working with characters, you generally want to see them at some point anyway, at least in testing. So you more or less need some font(s) that cover the characters. The Fileformat.info pages also contain links to browser support info, such as (U+20000) Font Support – a good starting point, though not quite complete. For example, U+20000 '𠀀' is also supported in SimSun-ExtB