rdplyrreshapetidyrdatabase-management

Row combination to remove NA's based on common ID's


I recently asked a question regarding database managemen in R packages tidyr, dplyr or similar in this link (Reorganizing columns by two column combination).

It was helpful, I managed to do the proposed code, but I was wring in the final format I was requiring.

I cannot manage to reach this format and was wondering how could be easily done.

Given this as an example database:

Factor 1    Factor 2        Year    value1   value2
A            green          2016     1.9      20
A            green          2015     1.9      20
A            green          2015      4       30
B            yellow         2015      3       10
B            yellow         2016      8       11

And trying to obtain:

Factor 1    Factor 2   value1.2015   value1.2016 value2.2015 value2.2016   
A            green          5.9          1.9            50           20
B            yellow         3            8              10           11

So, it would be setting common identifiers for Factor, 1 and Factor 2, to spread the dataset by years and summing up common years for values 1 and 2

I am a begginer with tidyr, dplyr and cannot easily manage to do this.

I have been able to spread the dataset by years doing :

df.spread<-df %>%
  gather(value1,value2,-factor1,-factor2,-Year) %>%
  unite(Year,Year, value1, sep = "") %>%
  spread(Year,value2)

but this does not sum the value1 for the common years as I want it.


Solution

  • We group by 'Factor1', 'Factor2', 'Year', get the sum of all columns (summarise_all), then gather into 'long' format, unite the 'Year', 'key' columns together to create a single column and spread from 'long' to 'wide' format

    library(tidyverse)
    df %>% 
        group_by(Factor1, Factor2, Year) %>%
        summarise_all(sum) %>% 
        gather(key, value, value1:value2) %>% 
        unite(Year, key, Year, sep=".") %>% 
        spread(Year, value)
    # Groups:   Factor1, Factor2 [2]
    #  Factor1 Factor2 value1.2015 value1.2016 value2.2015 value2.2016
    #  <chr>   <chr>         <dbl>       <dbl>       <dbl>       <dbl>
    #1 A       green           5.9         1.9          50          20
    #2 B       yellow          3           8            10          11
    

    This can also be done with dcast from data.table, where we can pass multiple value.var columns and a fun.aggregate argument

    library(data.table)
    dcast(setDT(df), Factor1 + Factor2 ~ Year, value.var = c('value1', 'value2'), sum)
    #    Factor1 Factor2 value1_2015 value1_2016 value2_2015 value2_2016
    #1:       A   green         5.9         1.9          50          20
    #2:       B  yellow         3.0         8.0          10          11
    

    data

    df <- structure(list(Factor1 = c("A", "A", "A", "B", "B"), Factor2 = c("green", 
    "green", "green", "yellow", "yellow"), Year = c(2016L, 2015L, 
    2015L, 2015L, 2016L), value1 = c(1.9, 1.9, 4, 3, 8), value2 = c(20L, 
    20L, 30L, 10L, 11L)), class = "data.frame", row.names = c(NA, 
    -5L))