pythonregexunicodetweetsemoticons

Problems with cleaning tweet (emoticons, smileys ...)


I'm facing a problem during cleaning tweets. I have a process which save tweets in a csv and then I do a pandas dataframe of the data.

x is a tweet from my dataframe :

'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''

More tweets : "b'RT @suzannelynch1: Meanwhile in #Washington... Almost two dozen members of #Congress write to #TheresaMay on eve of #StPatricksDay visit wa\\xe2\\x80\\xa6'

b"RT @KMTV_Kent: #KentTonight Poll:\\nKent\'s MPs will be having their say on Theresa May\'s #Brexit deal today. @SirRogerGaleMP said he\'ll back\\xe2\\x80\\xa6"

The result should looks like that : James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for' (Keep hashtags, juste deleting no utf8 caracteres)

I would like to clean this tweet. I tried to use regex with re.sub(my_regex), re.compile ...

Differents regex I tried : ([\U00010000-\U0010ffff],r'@[A-Za-z0-9]+',https?://[A-Za-z0-9./]+)

I also tried like that :

x.encode('ascii','ignore').decode('utf-8')  

It doesn't work because of the double backslash, and work when I do :

'to tell us whether or not fore\xe2\x80\xa6'.encode('ascii','ignore').decode('utf-8')

It returns me :

'to tell us whether or not fore'

Does some one know how to clean it ? Many thanks !


Solution

  • see if this helps

    a = 'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice:     "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''
    
    chars = re.findall("""[\s"'#]+\w+""",a)
    
    ''.join([c for c in chars if c])
    

    Output

    James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'