rdplyrduplicatesdcast

Remove duplicates based on 2 rows & specify what to do per column


I have dataset that contains:

A simple example of my dataset is this:

df <- data.frame(
  ID1 = c('A','A','B','C','D','D'), 
  ID2 = c('A','A','B1','C1','DE','DE'),
  List = c('LA','LA','LB','LC','L1D','L2D'),
  V1 = c(1, 0, 1, 1, 1, 1),
  V2 = c(0, 1, 0, 0, 1, 1),
  stringsAsFactors = FALSE
)

> df
  ID1 ID2 List V1 V2
1   A   A    LA  1  0
2   A   A    LA  0  1
3   B  B1    LB  1  0
4   C  C1    LC  1  0
5   D  DE   L1D  1  1
6   D  DE   L2D  1  1

And I want to end up with this:

> df_final
  ID1 ID2  List_1  List_2 V1 V2
1   A   A      LA      LA  1  1
3   B  B1      LB          1  0
4   C  C1      LC          1  0
5   D  DE     L1D     L2D  2  2

So my idea was to first dcast the List column:

df <- df %>%
  group_by(ID1, ID2) %>%
  mutate(Index = row_number(List))

and then to dcast with something like this:

df <- dcast(df, ID1+ID2 +V1 + V2 ~ Index, value.var = "List", fun.aggregate =sum)

But it can't handle the sumwhen also factors are involved (which makes sense). How can I best do this to get to df_final?


Solution

  • A base R option with aggregate + merge

    merge(
        aggregate(List ~ ID1 + ID2, df, `length<-`, 2),
        aggregate(cbind(V1, V2) ~ ID1 + ID2, df, sum)
    )
    

    which gives

      ID1 ID2 List.1 List.2 V1 V2
    1   A   A     LA     LA  1  1
    2   B  B1     LB   <NA>  1  0
    3   C  C1     LC   <NA>  1  0
    4   D  DE    L1D    L2D  2  2