pythonpython-2.7unicodeucs2surrogate-pairs

Python unicode indexing shows different character


I have a Unicode string in a "narrow" build of Python 2.7.10 containing a Unicode character. I'm trying to use that Unicode character as a lookup in a dictionary, but when I index the string to get the last Unicode character, it returns a different string:

>>> s = u'Python is fun \U0001f44d'
>>> s[-1]
u'\udc4d'

Why is this happening, and how do I retrieve '\U0001f44d' from the string?

Edit: unicodedata.unidata_version is 5.2.0 and sys.maxunicode is 65535.

Screenshot of issue


Solution

  • A Python 2 "narrow" build uses UTF-16 to store Unicode strings (a so-called leaky abstraction, so code points >U+FFFF are two UTF surrogates. To retrieve the code point, you have to get both the leading and trailing surrogate:

    Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:25:58) [MSC v.1500 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> s = u'Python is fun \U0001f44d'
    >>> s[-1]     # Just the trailing surrogate
    u'\udc4d'
    >>> s[-2:]    # leading and trailing
    u'\U0001f44d'
    

    Switch to Python 3.3+ where the problem has been solved and storage details of Unicode code points in a Unicode string are not exposed:

    Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:57:17) [MSC v.1600 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> s = u'Python is fun \U0001f44d'
    >>> s[-1]   # code points are stored in Unicode strings.
    '\U0001f44d'