rstringpattern-matchinglapplytext-mining

Removing string pattern from dataframe (Twitter data in RStudio)


I have a large dataframe (~500,000 observations) consisting of Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets, but I first need to remove retweet tags so they don't affect my keyword searches.

For example, in tweets that are retweets, the text looks like this: RT @BobsAccount Great article! Can't wait to learn more. I want to remove the string attached to RT @.....

I have used lapply and gsub to remove specific characters. For example, this successfully removed "@" : data <- data.frame(lapply(data, function(x) {gsub("@","", x)}))

But I can't figure out how to remove a "string pattern" (i.e. any text attached to "RT @"). Any help would be greatly appreciated!


Solution

  • You may use

    data <- data.frame(lapply(data, function(x) {gsub("\\bRT\\s+@\\S*\\s*","", x)}))
    

    The \bRT\s+@\S*\s* pattern matches

    See the regex demo.

    R code sample:

    text <- c("RT @BobsAccount Great article! Can't wait to learn more.")
    data <- data.frame(text)
    data <- data.frame(lapply(data, function(x) {gsub("\\bRT\\s+@\\S*\\s*","", x)}))
    data
    ## =>                                       text
    ##     1 Great article! Can't wait to learn more.