I recently asked a question regarding database managemen in R packages tidyr, dplyr or similar in this link (Reorganizing columns by two column combination).
It was helpful, I managed to do the proposed code, but I was wring in the final format I was requiring.
I cannot manage to reach this format and was wondering how could be easily done.
Given this as an example database:
Factor 1 Factor 2 Year value1 value2
A green 2016 1.9 20
A green 2015 1.9 20
A green 2015 4 30
B yellow 2015 3 10
B yellow 2016 8 11
And trying to obtain:
Factor 1 Factor 2 value1.2015 value1.2016 value2.2015 value2.2016
A green 5.9 1.9 50 20
B yellow 3 8 10 11
So, it would be setting common identifiers for Factor, 1 and Factor 2, to spread the dataset by years and summing up common years for values 1 and 2
I am a begginer with tidyr, dplyr and cannot easily manage to do this.
I have been able to spread the dataset by years doing :
df.spread<-df %>%
gather(value1,value2,-factor1,-factor2,-Year) %>%
unite(Year,Year, value1, sep = "") %>%
spread(Year,value2)
but this does not sum the value1 for the common years as I want it.
We group by 'Factor1', 'Factor2', 'Year', get the sum
of all columns (summarise_all
), then gather
into 'long' format, unite
the 'Year', 'key' columns together to create a single column and spread
from 'long' to 'wide' format
library(tidyverse)
df %>%
group_by(Factor1, Factor2, Year) %>%
summarise_all(sum) %>%
gather(key, value, value1:value2) %>%
unite(Year, key, Year, sep=".") %>%
spread(Year, value)
# Groups: Factor1, Factor2 [2]
# Factor1 Factor2 value1.2015 value1.2016 value2.2015 value2.2016
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 A green 5.9 1.9 50 20
#2 B yellow 3 8 10 11
This can also be done with dcast
from data.table
, where we can pass multiple value.var
columns and a fun.aggregate
argument
library(data.table)
dcast(setDT(df), Factor1 + Factor2 ~ Year, value.var = c('value1', 'value2'), sum)
# Factor1 Factor2 value1_2015 value1_2016 value2_2015 value2_2016
#1: A green 5.9 1.9 50 20
#2: B yellow 3.0 8.0 10 11
df <- structure(list(Factor1 = c("A", "A", "A", "B", "B"), Factor2 = c("green",
"green", "green", "yellow", "yellow"), Year = c(2016L, 2015L,
2015L, 2015L, 2016L), value1 = c(1.9, 1.9, 4, 3, 8), value2 = c(20L,
20L, 30L, 10L, 11L)), class = "data.frame", row.names = c(NA,
-5L))