pythonstringunicodeemoticons

How to properly print a list of unicode characters in python?


I am trying to search for emoticons in python strings. So I have, for example,

em_test = ['\U0001f680']
print(em_test)
['🚀']
test = 'This is a test string 💰💰🚀'
if any(x in test for x in em_test):
    print ("yes, the emoticon is there")
else: 
    print ("no, the emoticon is not there")

yes, the emoticon is there

and if a search em_test in

'This is a test string 💰💰🚀'

I can actually find it.

So I have made a csv file with all the emoticons I want defined by their unicode. The CSV looks like this:

\U0001F600

\U0001F601

\U0001F602

\U0001F923

and when I import it and print it I actullay do not get the emoticons but rather just the text representation:

['\\U0001F600',
 '\\U0001F601',
 '\\U0001F602',
 '\\U0001F923',
...
]

and hence I cannot use this to search for these emoticons in another string... I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.

Any suggestions?


Solution

  • You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.

    Just for fun, I'll also use unicodedata to get the names of those emojis.

    import unicodedata as ud
    
    emojis = [
        '\\U0001F600',
        '\\U0001F601',
        '\\U0001F602',
        '\\U0001F923',
    ]
    
    for u in emojis:
        s = u.encode('ASCII').decode('unicode-escape')
        print(u, ud.name(s), s)
    

    output

    \U0001F600 GRINNING FACE 😀
    \U0001F601 GRINNING FACE WITH SMILING EYES 😁
    \U0001F602 FACE WITH TEARS OF JOY 😂
    \U0001F923 ROLLING ON THE FLOOR LAUGHING 🤣
    

    This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.

    You can make the decoding a little more robust by using

    u.encode('Latin1').decode('unicode-escape')
    

    but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.