pythonunicodeformattingemoji

Converting emojis to Unicode and vice versa in python 3


I am trying to convert an emoji into its Unicode in python 3. For example I would have the emoji 😀 and from this would like to get the corresponding unicode 'U+1F600'. Similarly I would like to convert the 'U+1F600' back to 😀. Now I have read the documentation and tried several options but pythons behaviour confuses me here.

>>> x = '😀'
>>> y = x.encode('utf-8')
>>> y
b'\xf0\x9f\x98\x80'

The emoji is converted to a byte object.

>>> z = y.decode('utf-8')
>>> z
'😀'

Converted the byte object back to the emoji, so far so good.

Now, taking the unicode for the emoji:

>>> c = '\U0001F600'
>>> d = c.encode('utf-8')
>>> d
>>> b'\xf0\x9f\x98\x80'

This prints out the byte encoding again.

>>> d.decode('utf-8')
>>> '😀'

This prints the emoji out again. I really can't figure out how to convert solely between the Unicode and the emoji.


Solution

  • '😀' is already a Unicode object. UTF-8 is not Unicode, it's a byte encoding for Unicode. To get the codepoint number of a Unicode character, you can use the ord function. And to print it in the form you want you can format it as hex. Like this:

    s = '😀'
    print('U+{:X}'.format(ord(s)))
    

    output

    U+1F600
    

    If you have Python 3.6+, you can make it even shorter (and more efficient) by using an f-string:

    s = '😀'
    print(f'U+{ord(s):X}')
    

    BTW, if you want to create a Unicode escape sequence like '\U0001F600' there's the 'unicode-escape' codec. However, it returns a bytes string, and you may wish to convert that back to text. You could use the 'UTF-8' codec for that, but you might as well just use the 'ASCII' codec, since it's guaranteed to only contain valid ASCII.

    s = '😀'
    print(s.encode('unicode-escape'))
    print(s.encode('unicode-escape').decode('ASCII'))
    

    output

    b'\\U0001f600'
    \U0001f600
    

    I suggest you take a look at this short article by Stack Overflow co-founder Joel Spolsky The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).