[SOLVED] Remove duplicates based on 2 rows & specify what to do per column

Remove duplicates based on 2 rows & specify what to do per column

I have dataset that contains:

Duplicate IDs (column ID1 and ID2) that I want to get rid of
Column List that can contain a specification (type: string) that differs between the duplicates,
And 2 columns V1 and V2 that contain values (type: int) that I want to sum between the duplicates.

A simple example of my dataset is this:

df <- data.frame(
  ID1 = c('A','A','B','C','D','D'), 
  ID2 = c('A','A','B1','C1','DE','DE'),
  List = c('LA','LA','LB','LC','L1D','L2D'),
  V1 = c(1, 0, 1, 1, 1, 1),
  V2 = c(0, 1, 0, 0, 1, 1),
  stringsAsFactors = FALSE
)

> df
  ID1 ID2 List V1 V2
1   A   A    LA  1  0
2   A   A    LA  0  1
3   B  B1    LB  1  0
4   C  C1    LC  1  0
5   D  DE   L1D  1  1
6   D  DE   L2D  1  1

And I want to end up with this:

> df_final
  ID1 ID2  List_1  List_2 V1 V2
1   A   A      LA      LA  1  1
3   B  B1      LB          1  0
4   C  C1      LC          1  0
5   D  DE     L1D     L2D  2  2

So my idea was to first dcast the List column:

df <- df %>%
  group_by(ID1, ID2) %>%
  mutate(Index = row_number(List))

and then to dcast with something like this:

df <- dcast(df, ID1+ID2 +V1 + V2 ~ Index, value.var = "List", fun.aggregate =sum)

But it can't handle the sumwhen also factors are involved (which makes sense). How can I best do this to get to df_final?

Solution

A base R option with aggregate + merge

merge(
    aggregate(List ~ ID1 + ID2, df, `length<-`, 2),
    aggregate(cbind(V1, V2) ~ ID1 + ID2, df, sum)
)

which gives

  ID1 ID2 List.1 List.2 V1 V2
1   A   A     LA     LA  1  1
2   B  B1     LB   <NA>  1  0
3   C  C1     LC   <NA>  1  0
4   D  DE    L1D    L2D  2  2