I’ve got a df that consists of Twitter handles that I wish to scrape on a regular basis.
df=data.frame(twitter_handles=c("@katyperry","@justinbieber","@Cristiano","@BarackObama"))
My Methodology
I would like to run a for
loop that loops over each of the handles in my df and creates multiple dataframes:
1) By using the rtweet
library, I would like to gather tweets using the search_tweets
function.
2) Then I would like to merge the new tweets to existing tweets for each dataframe, and then use the unique
function to remove any duplicate tweets.
3) For each dataframe, I'd like to add a column with the name of the Twitter handle used to obtain the data. For example: For the database of tweets obtained using the handle @BarackObama, I'd like an additional column called Source
with the handle @BarackObama.
4) In the event that the API returns 0 tweets, I would like Step 2) to be ignored. Very often, when the API returns 0 tweets, I get an error as it attempts to merge an empty dataframe with an existing one.
5) Finally, I would like to save the results of each scrape to the different dataframe objects. The name of each dataframe object would be its Twitter handle, in lower case and without the @
My Desired Output
My desired output would be 4 dataframes, katyperry
, justinbieber
, cristiano
& barackobama
.
My Attempt
library(rtweet)
library(ROAuth)
#Accessing Twitter API using my Twitter credentials
key <-"yKxxxxxxxxxxxxxxxxxxxxxxx"
secret <-"78EUxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(key,secret)
#Dataframe of Twitter handles
df=data.frame(twitter_handles=c("@katyperry","@justinbieber","@Cristiano","@BarackObama"))
# Setting up the query
query <- as.character(df$twitter_handles)
query <- unlist(strsplit(query,","))
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10000,include_rts = FALSE)
#Strip tweets that contain RTs
tweets.dataframe <- c(tweets.dataframe,result)
tweets.dataframe <- unique(tweets.dataframe)
}
However I have not been able to figure out how to include in my for loop the part which ignores the concatenation step if the API returns 0 tweets for a given handle.
Also, my for loop does not return 4 dataframes in my environment, but stores the results as a Large list
I identified a post that addresses a problem very similar to the one I face, but I find it difficult to adapt to my question.
Your inputs would be greatly appreciated.
Edit: I have added Step 3) in My Methodology, in case you are able to help with that too.
tweets.dataframe = list()
# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
result<-search_tweets(query[i],n=10,include_rts = FALSE)
if (nrow(result) > 0) { # only if result has data
tweets.dataframe <- c(tweets.dataframe, list(result))
}
}
# tweets.dataframe is now a list where each element is a date frame containing
# the results from an individual query; for example...
tweets.dataframe[[1]]
# to combine them into one data frame
do.call(rbind, tweets.dataframe)
in response to a reply...
twitter_handles <- c("@katyperry","@justinbieber","@Cristiano","@BarackObama")
# Loop through the twitter handles & store the results as individual dataframes
for(handle in twitter_handles) {
result <- search_tweets(handle, n = 15 , include_rts = FALSE)
result$Source <- handle
df_name <- substring(handle, 2)
if(exists(df_name)) {
assign(df_name, unique(rbind(get(df_name), result)))
} else {
assign(df_name, result)
}
}