rdataframeaggregate

How to group by and concatenate strings but skip one group?


I have a table with people and the groups they belong to. It is formatted like so:

person_id <- c("A1", "A1", "A1", "A1", "A2", "A2", "A3", "A3", "B1", "B1", "C1", "C1", "C2", "C2")
year <- c(2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016, 2015, 2016)
group_id <- c("abc", "abc", "cdz", "cdz", "abc", "abc", "ghe", "ghe", "abc", "fjx", "ghe", "ghe", "cdz", "cdz")
example <- data.frame(person_id, group_id, year)

I want to create a column that for each person shows which other people they share a group with in a particular year. This is what I have currently:

example <- within(example, {
  connections <- ave(person_id, group_id, year, 
                     FUN=function(x) paste(x, collapse=', ')
                     )})
res <- example %>%
  group_by(person_id, year) %>%
  summarise(joint=paste(connections, collapse=', '))

This is close to what I need, but I do not want to include the person_id in each row. For example, my code creates a new column with the first value of "A1,A2,B1,A1,C2". I would like have this value be "A2,B1,C2". In my example person B1 does not share a group with anyone in 2016. My code produces the row value of "B1", but I would like this cell to be an empty string. How can I achieve this?

Also, the data I am working with is quite large, approximately 1 billion rows. It seems rather inefficient to group by twice, but I am not sure what I want to do is possible without doing so. Is there a better way to approach this?

Note: I cannot use tidyr. Also, my R version is 3.4.3.


Solution

  • code without tidyr:

    library(dplyr)
    library(tibble)
    
    example %>%
      mutate(p_y = paste(person_id, year, sep = '_'), person_id =NULL, year = NULL) %>%
      igraph::graph_from_data_frame() %>%
      igraph::components() %>%
      igraph::membership() %>%
      enframe() %>%
      filter(grepl('_', name)) %>%
      mutate(person_id = sub("_.*", "", name), year = sub(".*_", "", name))%>%
      group_by(year, value) %>%
      mutate(value = list(person_id)) %>%
      group_by(person_id, year) %>%
      summarise(value = toString(setdiff(value[[1]], person_id)), .groups = 'drop')
    
    
       # A tibble: 12 × 3
       person_id year  value       
       <chr>     <chr> <chr>       
     1 A1        2015  "A2, B1, C2"
     2 A1        2016  "A2, C2"    
     3 A2        2015  "A1, B1, C2"
     4 A2        2016  "A1, C2"    
     5 A3        2015  "C1"        
     6 A3        2016  "C1"        
     7 B1        2015  "A1, A2, C2"
     8 B1        2016  ""          
     9 C1        2015  "A3"        
    10 C1        2016  "A3"        
    11 C2        2015  "A1, A2, B1"
    12 C2        2016  "A1, A2" 
    

    Edit:

    In the case where no transitivity:

    example %>%
      group_by(year, group_id) %>%
      mutate(v = list(person_id)) %>%
      group_by(person_id, year) %>%
      summarise(v = toString(setdiff(unlist(v), person_id)))
      
    # A tibble: 12 × 3
    # Groups:   person_id [6]
       person_id  year v           
       <chr>     <dbl> <chr>       
     1 A1         2015 "A2, B1, C2"
     2 A1         2016 "A2, C2"    
     3 A2         2015 "A1, B1"    
     4 A2         2016 "A1"        
     5 A3         2015 "C1"        
     6 A3         2016 "C1"        
     7 B1         2015 "A1, A2"    
     8 B1         2016 ""          
     9 C1         2015 "A3"        
    10 C1         2016 "A3"        
    11 C2         2015 "A1"        
    12 C2         2016 "A1"