[SOLVED] supplemental codepoints to unicode string in python

supplemental codepoints to unicode string in python

unichr(0x10000) fails with a ValueError when cpython is compiled without --enable-unicode=ucs4.

Is there a language builtin or core library function that converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on?

Solution

Yes, here you go:

>>> unichr(0xd800)+unichr(0xdc00)
u'\U00010000'

The crucial point to understand is that unichr() converts an integer to a single code unit in the Python interpreter's string encoding. The The Python Standard Library documentation for 2.7.3, 2. Built-in Functions, on unichr() reads,

Return the Unicode string of one character whose Unicode code is the integer i.... The valid range for the argument depends how Python was configured – it may be either UCS2 [0..0xFFFF] or UCS4 [0..0x10FFFF]. ValueError is raised otherwise.

I added emphasis to "one character", by which they mean "one code unit" in Unicode terms.

I'm assuming that you are using Python 2.x. The Python 3.x interpreter has no built-in unichr() function. Instead the The Python Standard Library documentation for 3.3.0, 2. Built-in Functions, on chr() reads,

Return the string representing a character whose Unicode codepoint is the integer i.... The valid range for the argument is from 0 through 1,114,111 (0x10FFFF in base 16).

Note that the return value is now a string of unspecified length, not a string with a single code unit. So in Python 3.x, chr(0x10000) would behave as you expected. It "converts an arbitrary unicode scalar value or code-point to a unicode string that works regardless of what kind of python interpreter the program is running on".

But back to Python 2.x. If you use unichr() to create Python 2.x unicode objects, and you are using Unicode scalar values above 0xFFFF, then you are committing your code to being aware of the Python interpreter's implementation of unicode objects.

You can isolate this awareness with a function which tries unichr() on a scalar value, catches ValueError, and tries again with the corresponding UTF-16 surrogate pair:

def unichr_supplemental(scalar):
     try:
         return unichr(scalar)
     except ValueError:
         return unichr( 0xd800 + ((scalar-0x10000)//0x400) ) \
               +unichr( 0xdc00 + ((scalar-0x10000)% 0x400) )

>>> unichr_supplemental(0x41),len(unichr_supplemental(0x41))
(u'A', 1)
>>> unichr_supplemental(0x10000), len(unichr_supplemental(0x10000))
(u'\U00010000', 2)

But you might find it easier to just convert your scalars to 4-byte UTF-32 values in a UTF-32 byte string, and decode this byte string into a unicode string:

>>> '\x00\x00\x00\x41'.decode('utf-32be'), \
... len('\x00\x00\x00\x41'.decode('utf-32be'))
(u'A', 1)
>>> '\x00\x01\x00\x00'.decode('utf-32be'), \
... len('\x00\x01\x00\x00'.decode('utf-32be'))
(u'\U00010000', 2)

The code above was tested on Python 2.6.7 with UTF-16 encoding for Unicode strings. I didn't test it on a Python 2.x intepreter with UTF-32 encoding for Unicode strings. However, it should work unchanged on any Python 2.x interpreter with any Unicode string implementation.