I have a Unicode string in a "narrow" build of Python 2.7.10 containing a Unicode character. I'm trying to use that Unicode character as a lookup in a dictionary, but when I index the string to get the last Unicode character, it returns a different string:
>>> s = u'Python is fun \U0001f44d'
>>> s[-1]
u'\udc4d'
Why is this happening, and how do I retrieve '\U0001f44d'
from the string?
Edit: unicodedata.unidata_version
is 5.2.0 and sys.maxunicode
is 65535.
A Python 2 "narrow" build uses UTF-16 to store Unicode strings (a so-called leaky abstraction, so code points >U+FFFF are two UTF surrogates. To retrieve the code point, you have to get both the leading and trailing surrogate:
Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1] # Just the trailing surrogate
u'\udc4d'
>>> s[-2:] # leading and trailing
u'\U0001f44d'
Switch to Python 3.3+ where the problem has been solved and storage details of Unicode code points in a Unicode string are not exposed:
Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> s = u'Python is fun \U0001f44d'
>>> s[-1] # code points are stored in Unicode strings.
'\U0001f44d'