pythonpython-3.xstringutf-8character-encoding

How to convert utf-8 encoding to a string?


I was trying to preprocess some tweet text. The text was in a csv file that has been scraped by tweepy. I am using Jupyter Notebook and let us suppose the it is stored in variable 'p' and the text looks something like this when I just output it using cell output:

"b'@sarahbea34343 \\xf0\\x9f\\x98\\x94 I\\xe2\\x80\\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf'"

Instead if I do print(p) in Jupyter then the output is:

"b'@sarahbea34343 \xf0\x9f\x98\x94 I\xe2\x80\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf'"

I checked on the internet and it seemed that this is in byte class utf-8 encoding. So I tried to decode using ".decode('utf-8')" and it gave an error. The problem that i found out was that as it was stored in csv file the utf-8 encoding was stored as a string and hence this whole tweet was a string. Which means even the backslash is encoded as a string. I don't seem to figure out how do I convert it such that I can remove these emojis and other character's utf encoding?

I have tried multiple things that resulted back in same string again, such as :

p.encode('ascii','ignore').decode('ascii')

or p.encode('latin-1').decode('utf-8').encode('ascii', 'ignore')


Solution

  • If the text really has been stored like this (so you are reading the file in text mode 'r') you can do this:

    # Strip leading b and inner quotes
    s = "b'@sarahbea34343 \xf0\x9f\x98\x94 I\xe2\x80\x99m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf'"[2:-1]
    
    # Encode as latin-1 to get bytes, decode from unicode-escape to unescape 
    # the byte expressions (\\xhh -> \xhh), encode as latin-1 again to get 
    # bytes again, then finally decode as UTF-8.
    
    new_s = s.encode('latin-1').decode('unicode-escape').encode('latin-1').decode('utf-8')
    print(new_s)
    @sarahbea34343 šŸ˜” I’m not going in overly optimistic tbh but hey... https://twitter.com/icxdsfdf