I have dataset that contains:
ID1
and ID2
) that I want to get rid ofList
that can contain a specification (type: string
) that differs between the duplicates,V1
and V2
that contain values (type: int
) that I want to sum
between the duplicates.A simple example of my dataset is this:
df <- data.frame(
ID1 = c('A','A','B','C','D','D'),
ID2 = c('A','A','B1','C1','DE','DE'),
List = c('LA','LA','LB','LC','L1D','L2D'),
V1 = c(1, 0, 1, 1, 1, 1),
V2 = c(0, 1, 0, 0, 1, 1),
stringsAsFactors = FALSE
)
> df
ID1 ID2 List V1 V2
1 A A LA 1 0
2 A A LA 0 1
3 B B1 LB 1 0
4 C C1 LC 1 0
5 D DE L1D 1 1
6 D DE L2D 1 1
And I want to end up with this:
> df_final
ID1 ID2 List_1 List_2 V1 V2
1 A A LA LA 1 1
3 B B1 LB 1 0
4 C C1 LC 1 0
5 D DE L1D L2D 2 2
So my idea was to first dcast the List
column:
df <- df %>%
group_by(ID1, ID2) %>%
mutate(Index = row_number(List))
and then to dcast with something like this:
df <- dcast(df, ID1+ID2 +V1 + V2 ~ Index, value.var = "List", fun.aggregate =sum)
But it can't handle the sum
when also factors are involved (which makes sense). How can I best do this to get to df_final
?
A base R option with aggregate
+ merge
merge(
aggregate(List ~ ID1 + ID2, df, `length<-`, 2),
aggregate(cbind(V1, V2) ~ ID1 + ID2, df, sum)
)
which gives
ID1 ID2 List.1 List.2 V1 V2
1 A A LA LA 1 1
2 B B1 LB <NA> 1 0
3 C C1 LC <NA> 1 0
4 D DE L1D L2D 2 2