rdplyrstringrrtweet

Streamlining cleaning Tweet text with Stringr


I am learning about text mining and rTweet and I am currently brainstorming on the easiest way to clean text obtained from tweets. I have been using the method recommended on this link to remove URLs, remove anything other than English letters or space, remove stopwords, remove extra whitespace, remove numbers, remove punctuations.

This method uses both gsub and tm_map() and I was wondering if it was possible to stream line the cleaning process using stringr to simply add them to a cleaning pipe line. I saw an answer in the site that recommended the following function but for some reason I am unable to run it.

clean_tweets <- function(x) {
        x %>%
        str_remove_all(" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)") %>%
        str_replace_all("&amp;", "and") %>%
        str_remove_all("[[:punct:]]") %>%
        str_remove_all("^RT:? ") %>%
        str_remove_all("@[[:alnum:]]+") %>%
        str_remove_all("#[[:alnum:]]+") %>%
        str_replace_all("\\\n", " ") %>%
        str_to_lower() %>%
        str_trim("both")
    }
    

Clean Solution:

tweetsClean <- df %>% 
  mutate(clean = clean_tweets(text))

Lastly, is it possible to retain the emojis in order to count the frequency of which emojis are being used and potentially create custom sentiments for each?

Emoji Solution:

library(emo)
TopEmoji <- tweetsClean %>%
          mutate(emoji = ji_extract_all(text)) %>%
          unnest(cols = c(emoji)) %>%
          count(emoji, sort = TRUE) %>%
          top_n(5)

Once the text values are clean my process is to select relevant columns, add a line number to retain the tweet each word belongs to, and unnest the tokens

tweetsClean <- tweets %>%
    select(created_at,text) %>%
    mutate(linenumber = row_number()) %>%
    select(linenumber,everything()) %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

Following that I assign the desired sentiments and assign each line a value depending on the sum of the sentiments obtained with AFINN:

sentiment_bing <- get_sentiments("bing") 
sentiment_AFINN <- get_sentiments("afinn")

tweetsValue <- tweetsClean %>%
  inner_join(sentiment_bing) %>%
  inner_join(sentiment_AFINN) %>%
  group_by(linenumber,created_at) %>%
  mutate(TweetValue = sum(value))

Thanks for the pointers!

TestData:

df <- structure(list(created_at = structure(c(1622854597, 1622853904, 
1622853716, 1622778852, 1622448379, 1622450951, 1622777623, 1622853561, 
1622466544, 1622853192), tzone = "UTC", class = c("POSIXct", 
"POSIXt")), text = c("@elonmusk can the dogefather ride @CumRocketCrypto into the night. #SpaceX @dogecoin https://twitter.com/", 
"@CryptoCrunchApp @CumRocketCrypto @vergecurrency @InuSanshu @Mettalex @UniLend_Finance @NuCypher @Chiliz @JulSwap @CurveFinance @PolyDoge Wrong this twitt shansu", 
"9am AEST Sunday morning!!!\nI will be hosting on the @CumRocketCrypto twitch channel!\n\nSo cum say Hi! https://twitter.com/", 
"@SamInCrypt1 @IamMars34147875 @DylanMcKitten @elonmusk @CumRocketCrypto Cumrocket <U+0001F4A6> https://twitter.com/", 
"@DK19663019 @CumRocketCrypto Oh hey, that's me! Did you grab one?", 
"@DK19663019 @CumRocketCrypto Thank you! <U+2764><U+FE0F>", "@CumRocketInfo @elonmusk @CumRocketCrypto Maybe he'd like to meet the CUMrocket models? https://twitter.com/", 
"@AerotyneToken @CumRocketCrypto Is there a way to make sure ones wallet ID is on the list?", 
"@AerotyneToken @CumRocketCrypto Does one have to attend the giveaway stream, or just hold 0.2 BNB of #CUMMIES and #ATYNE?\nAnd what happens if I bought about 0.2BNB each and the BNB price rises? Do I have to check every day if they're still worth at least 0.2?", 
"@Don_Santino1 @brandank_cr @PAWGcoinbsc @Tyga @CumRocketCrypto Massive bull flag. 10x is imminent!"
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))

Solution

  • To answer your primary question, the clean_tweets() function is not working in the line "Clean <- tweets %>% clean_tweets" presumably because you are feeding it a dataframe. However, the function's internals (i.e., the str_ functions) require character vectors (strings).

    cleaning issue

    I say "presumably" here because I'm not sure what your tweets object looks like, so I can't be sure. However, at least on your test data, the following solves the problem.

    df %>% 
      mutate(clean = clean_tweets(text))
    

    If you just wanted the character vector back, you could also do

    clean_tweets(df$text)
    

    emoji issue

    Regarding the possibility of retaining emojis and assigning them sentiments, yes, I think you would proceed in essentially the way you have with the rest of the text: tokenize them, assign numeric values to each one, then aggregate.