rdplyrset-intersectionset-operations

how to intersect elements of one column along with group_by in R


Lets assume, my data is like

  group_id    col1
1        1     A,B
2        1     B,C
3        2     A,C
4        2     B,D
5        3     A,D
6        3 A,B,C,D

I would like to summarise/mutate col1 where its elements are intersected within same group (over group_id). My required output is like (if summarising)

  group_id col1
1        1    B
2        2 <NA>
3        3  A,D

or like this (if mutating)

  group_id col1
1        1    B
2        1    B
3        2 <NA>
4        2 <NA>
5        3  A,D
6        3  A,D

I can easily create a union by using function toString but scratching my head for how to have common elements in the output. Basically intersect requires at least two arguments and is therefore not working here.

dput(df) is as under

df <-  structure(list(group_id = c(1L, 1L, 2L, 2L, 3L, 3L), col1 = c("A,B", 
"B,C", "A,C", "B,D", "A,D", "A,B,C,D")), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

Solution

  • You can split col1 on comma and use Reduce + intersect to get common values in each group_id.

    library(dplyr)
    df %>%
      group_by(group_id) %>%
      summarise(col1 = toString(Reduce(intersect, strsplit(col1, ','))))
    
    #  group_id col1  
    #*    <int> <chr> 
    #1        1 "B"   
    #2        2 ""    
    #3        3 "A, D"