pythonunicodeemojiunicode-normalization

How do I process multi-character Unicode emojis in Python 3 with the unicodedata module?


While I was working with emojis and attempting to acquire their codepoint and names with the unicodedata module, I kept having issues with multi-character emojis. The module refuses to let me use strings and instead wanted characters. I tried normalizing, I tried encoding in utf-8 and unicode-escape, and I researched it again and again, but I was not successful in finding what was going on!

emojis = ["๐Ÿ’–", "๐Ÿ’˜", "๐Ÿ’", "๐Ÿ’ž", "โฃ๏ธ", "โœจ"]
for emoji in emojis:
    codepoint: str = hex(ord(emoji))
    filename = 'emoji_u{0}.png'.format(codepoint[2:])
    print('{emoji} ({codepoint}) => {filename}'.format(emoji=emoji,
                                                       codepoint=codepoint,
                                                       filename=filename))

While yes, the above code does not use the unicodedata module, it shows you what I was having a problem with regardless...

๐Ÿ’– (0x1f496) => emoji_u1f496.png
๐Ÿ’˜ (0x1f498) => emoji_u1f498.png
๐Ÿ’ (0x1f49d) => emoji_u1f49d.png
๐Ÿ’ž (0x1f49e) => emoji_u1f49e.png
Traceback (most recent call last):
  File "F:/Programming/Languages/Vue.js/lovely/collect.py", line 8, in <module>
    codepoint: str = hex(ord(emoji))
TypeError: ord() expected a character, but string of length 2 found

After a break, somehow, I managed to convert the emoji unintentionally, from this: โฃ๏ธ to this: โฃ. Python was able to process this new emoji character perfectly fine. The unicodedata module likes it too!

So what's the difference? Why does one have color and not the other in both my browser and IDE? And most importantly, how do I convert multi-character emojis to single-character emojis in Python?


Solution

  • Some human-perceived single-character emoji (called graphemes) are made up of multiple code points. Here's a way to handle them. I added a complicated example:

    import unicodedata as ud
    
    emojis = ["๐Ÿ’–", "๐Ÿ’˜", "๐Ÿ’", "๐Ÿ’ž", "โฃ๏ธ", "โœจ", "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ"]
    for emoji in emojis:
        print('Emoji:',emoji)
        for cp in emoji:
            print(f'    {cp} U+{ord(cp):04X} {ud.name(cp)}')
    

    Output:

    Emoji: ๐Ÿ’–
        ๐Ÿ’– U+1F496 SPARKLING HEART
    Emoji: ๐Ÿ’˜
        ๐Ÿ’˜ U+1F498 HEART WITH ARROW
    Emoji: ๐Ÿ’
        ๐Ÿ’ U+1F49D HEART WITH RIBBON
    Emoji: ๐Ÿ’ž
        ๐Ÿ’ž U+1F49E REVOLVING HEARTS
    Emoji: โฃ๏ธ
        โฃ U+2763 HEAVY HEART EXCLAMATION MARK ORNAMENT
        ๏ธ U+FE0F VARIATION SELECTOR-16
    Emoji: โœจ
        โœจ U+2728 SPARKLES
    Emoji: ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ
        ๐Ÿ‘จ U+1F468 MAN
        โ€ U+200D ZERO WIDTH JOINER
        ๐Ÿ‘ฉ U+1F469 WOMAN
        โ€ U+200D ZERO WIDTH JOINER
        ๐Ÿ‘ง U+1F467 GIRL
        โ€ U+200D ZERO WIDTH JOINER
        ๐Ÿ‘ฆ U+1F466 BOY
    

    If the emoji are in a single string the rules for processing a single grapheme are complicated, but implemented by the 3rd party regex module. \X matches graphemes:

    import unicodedata as ud
    import regex
    
    for m in regex.finditer(r'\X', '๐Ÿ’–๐Ÿ’˜๐Ÿ’๐Ÿ’žโฃ๏ธโœจ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ'):
        emoji = m.group(0)
        print(f'{emoji}   {ascii(emoji)}')
    

    Output:

    ๐Ÿ’–   '\U0001f496'
    ๐Ÿ’˜   '\U0001f498'
    ๐Ÿ’   '\U0001f49d'
    ๐Ÿ’ž   '\U0001f49e'
    โฃ๏ธ   '\u2763\ufe0f'
    โœจ   '\u2728'
    ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ   '\U0001f468\u200d\U0001f469\u200d\U0001f467\u200d\U0001f466'