rtwitterweb-scrapingrtweet

R – Using a loop on a list of Twitter handles to extract tweets and create multiple data frames


I’ve got a df that consists of Twitter handles that I wish to scrape on a regular basis.

df=data.frame(twitter_handles=c("@katyperry","@justinbieber","@Cristiano","@BarackObama"))

My Methodology

I would like to run a for loop that loops over each of the handles in my df and creates multiple dataframes:

1) By using the rtweet library, I would like to gather tweets using the search_tweets function.

2) Then I would like to merge the new tweets to existing tweets for each dataframe, and then use the unique function to remove any duplicate tweets.

3) For each dataframe, I'd like to add a column with the name of the Twitter handle used to obtain the data. For example: For the database of tweets obtained using the handle @BarackObama, I'd like an additional column called Source with the handle @BarackObama.

4) In the event that the API returns 0 tweets, I would like Step 2) to be ignored. Very often, when the API returns 0 tweets, I get an error as it attempts to merge an empty dataframe with an existing one.

5) Finally, I would like to save the results of each scrape to the different dataframe objects. The name of each dataframe object would be its Twitter handle, in lower case and without the @

My Desired Output

My desired output would be 4 dataframes, katyperry, justinbieber, cristiano & barackobama.

My Attempt

library(rtweet)
library(ROAuth)

#Accessing Twitter API using my Twitter credentials

key <-"yKxxxxxxxxxxxxxxxxxxxxxxx"
secret <-"78EUxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
setup_twitter_oauth(key,secret)

#Dataframe of Twitter handles    
df=data.frame(twitter_handles=c("@katyperry","@justinbieber","@Cristiano","@BarackObama"))

# Setting up the query
query <- as.character(df$twitter_handles)
query <- unlist(strsplit(query,","))
tweets.dataframe = list()

# Loop through the twitter handles & store the results as individual dataframes
for(i in 1:length(query)){
  result<-search_tweets(query[i],n=10000,include_rts = FALSE)
  #Strip tweets that  contain RTs
  tweets.dataframe <- c(tweets.dataframe,result)
  tweets.dataframe <- unique(tweets.dataframe)
}

However I have not been able to figure out how to include in my for loop the part which ignores the concatenation step if the API returns 0 tweets for a given handle.

Also, my for loop does not return 4 dataframes in my environment, but stores the results as a Large list

I identified a post that addresses a problem very similar to the one I face, but I find it difficult to adapt to my question.

Your inputs would be greatly appreciated.

Edit: I have added Step 3) in My Methodology, in case you are able to help with that too.


Solution

  • tweets.dataframe = list()
    
    # Loop through the twitter handles & store the results as individual dataframes
    for(i in 1:length(query)){
      result<-search_tweets(query[i],n=10,include_rts = FALSE)
    
      if (nrow(result) > 0) {  # only if result has data
        tweets.dataframe <- c(tweets.dataframe, list(result))
      }
    }
    
    # tweets.dataframe is now a list where each element is a date frame containing
    # the results from an individual query; for example...
    
    tweets.dataframe[[1]]
    
    # to combine them into one data frame
    
    do.call(rbind, tweets.dataframe)
    

    in response to a reply...

    twitter_handles <- c("@katyperry","@justinbieber","@Cristiano","@BarackObama")
    
    # Loop through the twitter handles & store the results as individual dataframes
    for(handle in twitter_handles) {
      result <- search_tweets(handle, n = 15 , include_rts = FALSE)
      result$Source <- handle
    
      df_name <- substring(handle, 2)
    
      if(exists(df_name)) {
        assign(df_name, unique(rbind(get(df_name), result)))
      } else {
        assign(df_name, result)
      }
    }