rsentiment-analysistmemoticons

remove emoticons in R using tm package


I'm using the tm package to clean up a Twitter Corpus. However, the package is unable to clean up emoticons.

Here's a replicated code:

July4th_clean <- tm_map(July4th_clean, content_transformer(tolower))
Error in FUN(content(x), ...) : invalid input 'RT ElleJohnson Love of country is encircling the globes ������������������ july4thweekend July4th FourthOfJuly IndependenceDay NotAvailableOnIn' in 'utf8towcs'

Can someone point me in the right direction to remove the emoticons using the tm package?

Thank you,

Luis


Solution

  • You can use gsub to get rid of all non-ASCII characters.

    Texts = c("Let the stormy clouds chase, everyone from the place ☁  ♪ ♬",
        "See you soon brother ☮ ",
        "A boring old-fashioned message" ) 
    
    gsub("[^\x01-\x7F]", "", Texts)
    [1] "Let the stormy clouds chase, everyone from the place    "
    [2] "See you soon brother  "                                  
    [3] "A boring old-fashioned message"
    

    Details: You can specify character classes in regex's with [ ]. When the class description starts with ^ it means everything except these characters. Here, I have specified everything except characters 1-127, i.e. everything except standard ASCII and I have specified that they should be replaced with the empty string.