Problem description:
Mathematica use
"\:nnnn"
as the syntax for unicode input. E.g.,
if we enter
"\:6c34"
, we get "水"
("water" in Chinese).
But what if one wants to enter "\:1f618"
(face throwing a kiss).
When I tried this, I got "ὡ8"
, not "a face throwing a kiss"
.
So, Mathematica evaluates "\:1f61"
before I entered "8"
.
Question: How can we delay this evaluation or how can we enter any unicode input in general (as for hexadecimal with more than 4 characters)?
Software and hardware platform: I am running Mathematica 8 on an Intel Mac. I tried both the command line version of Mathematica and Mathematica notebook, they behave the same.
Thank you.
Reflections: Unicode is an extensible standard and it can grow (and it does grow:)). Software systems that implement this standard may only implement a subset of this standard in order to be valid and useful (8-bit, 16-bit or 32-bit encoding). One, as the user of a certain software package, should not make the assumption that once the software says it support unicode, it support the universal set of unicode.
Short answer: You can't do this because Mathematica doesn't support these characters properly. See at the end of the post for some workarounds.
Just to clear up some things:
There's no need for a 32-bit encoding to handle more than ~65000 Unicode characters. The most common encodings used for Unicode, UTF-8 and UTF-16, are multibyte encodings, meaning that a variable number of bytes are used to represent characters. UTF-16 can use either 2 or 4 bytes to represent a character. The Mathematica kernel will interpret every 2-byte sequence as a single character in a string, resulting in some invalid characters on occasion (when encountering a 4-byte sequence). This may be considered a bug. The front end is quite moody about how it handles 4-byte sequences, which is definitely a bug.
Limited workaround
When working strictly in the kernel (e.g. reading the Unicode data from a file), I sometimes use this function as a workaround to get the actual Unicode code point of 2-unit (4-byte) UTF-16 sequences:
toCodePoint[{a_, b_}] /; 16^^d800 <= a <= 16^^dbff && 16^^dc00 <= b <= 16^^dfff := (a - 16^^d800)*2^10 + (b - 16^^dc00) + 16^4
You can use
Split[ToCharacterCode[str], If[16^^d800 <= # <= 16^^dbff, True] &]
to split a UTF-16 string into Unicode characters correctly (either length-one or length-two, depending on the character).
This is an ugly and inconvenient workaround, and it will won't allow you to display anything of these characters in the front end unless you come up with some hack for that as well, e.g. importing the glyph reference images from unicode.org (at least for CJK they have them).
See also
See my earlier question on the same topic: Reading an UTF-8 encoded text file in Mathematica
If you are going to work with Chinese, you may come across this other problem too: Getting the Mathematica front end to obey the FontFamily option