I have a large dataframe (~500,000 observations) consisting of Twitter data (i.e. username, rewtweet counts, text) in RStudio. I want to run a text analysis on the tweets, but I first need to remove retweet tags so they don't affect my keyword searches.
For example, in tweets that are retweets, the text looks like this: RT @BobsAccount Great article! Can't wait to learn more.
I want to remove the string attached to RT @....
.
I have used lapply
and gsub
to remove specific characters. For example, this successfully removed "@" : data <- data.frame(lapply(data, function(x) {gsub("@","", x)}))
But I can't figure out how to remove a "string pattern" (i.e. any text attached to "RT @"). Any help would be greatly appreciated!
You may use
data <- data.frame(lapply(data, function(x) {gsub("\\bRT\\s+@\\S*\\s*","", x)}))
The \bRT\s+@\S*\s*
pattern matches
\bRT
- a whole word RT
\s+
- 1+ whitespaces@
- a @
char\S*
- 0+ non-whitespace chars\s*
- 0+ whitespace charsSee the regex demo.
R code sample:
text <- c("RT @BobsAccount Great article! Can't wait to learn more.")
data <- data.frame(text)
data <- data.frame(lapply(data, function(x) {gsub("\\bRT\\s+@\\S*\\s*","", x)}))
data
## => text
## 1 Great article! Can't wait to learn more.