I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.
Here's a replicated code:
July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'
Can someone point me in the right direction to remove the emoticons using the tm package?
Thank you,
Luis
You can use gsub
to get rid of all non-ASCII characters.
Texts = c("Let the stormy clouds chase, everyone from the place ☁ ♪ ♬",
"See you soon brother ☮ ",
"A boring old-fashioned message" )
gsub("[^\x01-\x7F]", "", Texts)
[1] "Let the stormy clouds chase, everyone from the place "
[2] "See you soon brother "
[3] "A boring old-fashioned message"
Details:
You can specify character classes in regex's with [ ]
. When the class description starts with ^
it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.