rdataframedplyrdata-wranglingsummarize

Reformat data to summarize and collapse rows into simple table


I have a dataset that looks something like this:

name party count year likes retweet
Tom R 1 2016 1357 23
Dave R 1 2016 1881 34
Larry D 1 2016 324 45
Tim D 1 2016 5587 56
Rob R 1 2016 9847 67
Sam D 1 2016 4466 78
Tom R 1 2017 ### ##
Dave R 1 2017 ### ##
Larry D 1 2017 ### ##
Tim D 1 2017 ### ##
Rob R 1 2017 ### ##
Sam D 1 2017 ### ##
Tom R 1 2018 ### ##
Dave R 1 2018 ### ##
Larry D 1 2018 ### ##
Tim D 1 2018 ### ##
Rob R 1 2018 ### ##
Sam D 1 2018 ### ##

I have several columns that I want to collapse and sum, so that I can have a simpler output of a long dataset. I you'll notice that there are a bunch of "###" symbols, but that was just because I didn't think it was necessary to add additional information, as I believe what I provided is enough to solve my problem.

But I want to reformat the data, so it's easier to create a plot. Basically, I want the variables below. So, I want the first column to be the year (one row for each year in the dataset). The second column should be the sum of rows that were collapsed to create that row (I created the "count" variable in Table 1, so that I could sum the row value.) I then want four new columns, each representing the sum total of likes and retweets for Democrats and Republicans for that year. It should look something like this:

year count likes_republicans likes_democrats retweets_republicans retweets_democrats
2016 6 13085 10377 124 179
2017 6 ### ### ### ###
2018 6 ### ### ### ###

I feel like there is a simple command in dplyr to do this type of problem, but I don't remember what it is and I feel like I really struggle to understand it and how it works. Any help would be greatly appreciated.


Solution

  • Sticking to data.table.

    > library(data.table)
    > dt <- melt(dt, c(1, 2, 4), c(3, 5:6)) |> 
    +   dcast(year ~ party + variable, fun=sum)
    > dt[, count := D_count + R_count]
    > dt
    Key: <year>
        year D_count D_likes D_retweet R_count R_likes R_retweet count
       <int>   <int>   <int>     <int>   <int>   <int>     <int> <int>
    1:  2016       3   10377      6859       3   13085      4733     6
    2:  2017       3    3766      5346       3    5944      7332     6
    3:  2018       3    7426      2255       3   56041      6109     6
    

    Data:

    dt <- data.table::fread('
      name  party   count   year    likes   retweet
    Tom R   1   2016    1357    23
    Dave    R   1   2016    1881    34
    Larry   D   1   2016    324 45
    Tim D   1   2016    5587    56
    Rob R   1   2016    9847    4676
    Sam D   1   2016    4466    6758
    Tom R   1   2017    3456    2543
    Dave    R   1   2017    2345    3546
    Larry   D   1   2017    675 1423
    Tim D   1   2017    2345    3567
    Rob R   1   2017    143 1243
    Sam D   1   2017    746 356
    Tom R   1   2018    134 1423
    Dave    R   1   2018    54673   563
    Larry   D   1   2018    5467    1243
    Tim D   1   2018    1423    365
    Rob R   1   2018    1234    4123
    Sam D   1   2018    536 647
    ')