rcluster-analysisigraphtidygraph

How to identify groups of connected values across two columns, each of which has repeated values with multiple matches


Take the following data. I want to add a column indicating which group of connected values each row is part of.

library(tidyverse)
df <- structure(list(fruit = c("apple", "apple", "apple", "pear", "pear", 
                               "banana", "banana", "peach", "cherry"), name = c("joe", "sally", 
                                                                                "steve", "pete", "kate", "george", "alex", "alex", "alex")), class = c("tbl_df", 
                                                                                                                                                       "tbl", "data.frame"), row.names = c(NA, -9L))
df
# A tibble: 9 × 2
  fruit  name  
  <chr>  <chr> 
1 apple  joe   
2 apple  sally 
3 apple  steve 
4 pear   pete  
5 pear   kate  
6 banana george
7 banana alex  
8 peach  alex  
9 cherry alex  

Here is the kind of output I'm looking for. Groups 1 and 2 are straightforward--they are simply joined by the common fruit value.

Group3 is more complicated. George is connected to banana. Banana is connected to Alex, who is also connected to peach and cherry. So group3 contains George, Alex, banana, peach, and cherry.

# A tibble: 9 × 3
  fruit  name   group 
  <chr>  <chr>  <chr> 
1 apple  joe    group1
2 apple  sally  group1
3 apple  steve  group1
4 pear   pete   group2
5 pear   kate   group2
6 banana george group3
7 banana alex   group3
8 peach  alex   group3
9 cherry alex   group3

Essentially, the group field needs to contain a common ID for all the values which would be connected in a network graph, like so:

tidygraph::as_tbl_graph(df) %>%
  ggraph(layout = "tree") +
  geom_edge_link() + 
  geom_node_point() +
  geom_node_label(aes(label = name))

network graph showing the connections in df


Solution

  • You could try components from igraph

    library(igraph)
    df %>%
        mutate(group = paste0("group", {
            graph_from_data_frame(.) %>%
                components() %>%
                membership() %>%
                `[`(fruit)
        }))
    

    which gives

    # A tibble: 9 × 3
      fruit  name   group
      <chr>  <chr>  <chr>
    1 apple  joe    group1
    2 apple  sally  group1
    3 apple  steve  group1
    4 pear   pete   group2
    5 pear   kate   group2
    6 banana george group3
    7 banana alex   group3
    8 peach  alex   group3
    9 cherry alex   group3