I'm facing a problem during cleaning tweets. I have a process which save tweets in a csv and then I do a pandas dataframe of the data.
x is a tweet from my dataframe :
'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''
More tweets :
"b'RT @suzannelynch1: Meanwhile in #Washington... Almost two dozen members of #Congress write to #TheresaMay on eve of #StPatricksDay visit wa\\xe2\\x80\\xa6'
b"RT @KMTV_Kent: #KentTonight Poll:\\nKent\'s MPs will be having their say on Theresa May\'s #Brexit deal today. @SirRogerGaleMP said he\'ll back\\xe2\\x80\\xa6"
The result should looks like that :
James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'
(Keep hashtags, juste deleting no utf8 caracteres)
I would like to clean this tweet. I tried to use regex with re.sub(my_regex), re.compile ...
Differents regex I tried : ([\U00010000-\U0010ffff],r'@[A-Za-z0-9]+',https?://[A-Za-z0-9./]+)
I also tried like that :
x.encode('ascii','ignore').decode('utf-8')
It doesn't work because of the double backslash, and work when I do :
'to tell us whether or not fore\xe2\x80\xa6'.encode('ascii','ignore').decode('utf-8')
It returns me :
'to tell us whether or not fore'
Does some one know how to clean it ? Many thanks !
see if this helps
a = 'b\'RT @LBC: James O\\\'Brien on Geoffrey Cox\\\'s awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not fore\\xe2\\x80\\xa6\''
chars = re.findall("""[\s"'#]+\w+""",a)
''.join([c for c in chars if c])
Output
James O'Brien on Geoffrey Cox's awaited legal advice: "We are waiting for a single unelected expert to tell us whether or not for'