rdplyrdata-wranglingrtweet

Categorize observations in dataframe by different identifiers


I've searched around for a solution to this problem, but can't seem to find any.

I have pulled tweets from Danish MP's using the rtweet package to acces the Twitter API. I used get_gimeline() to pull the data.

get_timeline(c(politikere), n = 100,  parse = TRUE, since_id = "1315756184247435264", max_id = "1333904927559725056", type = "recent") %>%
  dplyr::filter(created_at > "2020-10-25" & created_at <="2020-12-01")  

Now i would like to categorize the different Twitter users by their Party ID, in order to do some pary specific sentiment analysis. From the API call you get all sorts of information in to a tibble dataframe e.g "user id" spanning to around 90 different variables.

user_id status_id created_at screen_name text description ...x_i

The point is that I want to create a new column in the dataset named party_id and I want to assign a new value onto each user according to the party they belong to: I would want to create a column which identifies the party affilitation. It should look something like this:

user_id status_id created_at screen_name text description party_id
1234346 683901040 2020-11-23 larsen_mc gg.. Danish MP.. Conservatives

I looked at the dplyr package but I can't quite get my head around how to assign the same value to different rows that does not share the same identifiers. If e.g all the conservative MP's shared the same status_id it would be a somewhat easier task by using inner_join, but every user has it's own unique identifier in this case (of course).

Here is the example_data

structure(list(user_id = c("2373406198", "4360080437", "3512158337", 
"746909257", "36910691", "58550919", "279986859", "1225930531", 
"26263965", "2222188479"), status_id = c("1354094283230474241", 
"1354707826317393922", "1354391556900483072", "1347169543853117444", 
"1354866447735005185", "1332633849659088897", "1355522537669734401", 
"1355554489361686530", "1329028442105458688", "1330791375449829376"
), created_at = structure(c(1611676209, 1611822489, 1611747085, 
1610025223, 1611860307, 1606559643, 1612016732, 1612024349, 1605700047, 
1606120363), tzone = "UTC", class = c("POSIXct", "POSIXt")), 
    screen_name = c("jacobmark_sf", "RuneLundEL", "kimvalentinDK", 
    "TommyPetersenDK", "JuulMona", "Blixt22", "JanEJoergensen", 
    "RasmusJarlov", "StemLAURITZEN", "olebirkolesen")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

Hopes this makes sense

Best, Gustav


Solution

  • Okay - I found a solution! After making the identifier manually (called Parti_id) I used the tidyverse package and used left_join():

    poldata <- poldata %>%   
     select(screen_name,Parti_id)  
    FTtweets <- left_join(tmlpol, poldata, by = "screen_name")