rdplyrsummarization

Get unique values from rows with comma separated values based on a specific category column in R


Lets say I have:

group     X                 Y                   Z
A         cat, dog          dog, fox        
A         fox, chicken      dog, fox, chicken
A
B         fox, dog 
B         fox
B                                               bunny

I want to summarize unique animals per group based on these three columns, as

group     animal
A         cat, dog, fox, chicken
B         bunny, dog, fox

The order of animals doens't matter, as long as they are the unique values. I tried:

df %>%  
group_by(group) %>% 
separate_rows("X", sep=",")   %>%  
distinct %>%    
summarise(X = toString(X)) 

                                       

Which does not work even for one column (I get a bunch of NAs)

I was thinking something such as

df %>%   group_by(group) %>%   summarise_at(vars("X","Y","Z"), sum)

Which works for numerical variables not separated by comma (one value per row)


Solution

  • One way would be:

    df %>%
       pivot_longer(-group)%>%
       separate_rows(value)%>%
       summarise(animal = toString(unique(value[nzchar(value)])), .by = group)
    
    # A tibble: 2 × 2
      group animal                
      <chr> <chr>                 
    1 A     cat, dog, fox, chicken
    2 B     fox, dog, bunny  
    

    In base R you could do:

    fn <- function(x) {
      toString(unique(scan(text=x, what="", sep=',', strip.white = TRUE, quiet = TRUE)))
    }
    aggregate(values~group, cbind(df[1], stack(df,-1)), fn)
    
      group                 values
    1     A cat, dog, fox, chicken
    2     B        fox, dog, bunny