rstringrrtweet

Filter Dataframe from Twitter API to exclude non-English text in R


I have a data frame containing tweets from the twitter API that has English and non-Engilsh tweets. Before posting this question, I have searched stack overflow and did not seem to find something that addresses what I am intending.

Since twitter has emojis, I want to filter out tweets that are not in English without consideration to emojis. I have tried using stringi::stri_enc_isascii() but that does not seem to recognize English tweets with Emojis as English.

For replication purposes, here are some texts:

"私は、トランプ大統領を信じています🇺🇸🇯🇵 #America"
"Thank you Nashville"
"🇺🇸 Bless America"

In the final corpus, I should only have the last two texts.

Thank you!


Solution

  • You can remove all non-ASCII characters from your dataset by doing:

    # assuming tweets is the field name where you store the tweets text messages
    dataset$tweets <- sapply(dataset$tweets, function(x) gsub("[^\x01-\x7F]", "", x))
    

    Then all your emojis and non-ascii characters will be left blank. The next step would be selecting only the rows where the tweets field is not empty.

    dataset <- dataset[dataset$tweets != ""]
    

    Now, if you want to keep the emojis, a better solution is to just do this process for indexing purposes and then use the index to filter the untouched data. For example:

    modified_tweets <- sapply(dataset$tweets, function(x) gsub("[^\x01-\x7F]", "", x))
    
    # now filter by condition
    dataset <- dataset[modified_tweets != ""]