rdataframefiltercluster-analysis

Filter rows based on combined set of values in a string


In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.

df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                              "5,6", "5,6,7", "6,7", 
                              "8,9", "8,9,10", "9,10", 
                              "11,12,13", "11,12,13", 
                              "11,12,13,14", "13,14", 
                              "15,16", "15,16,17", "16,17", 
                              "18,19", "18,19,20", "19,20"))

df
         overlap
  1        1,2,3
  2        1,2,3
  3      1,2,3,4
  4          3,4
  5          5,6
  6        5,6,7
  7          6,7
  8          8,9
  9       8,9,10
  10        9,10
  11    11,12,13
  12    11,12,13
  13 11,12,13,14
  14       13,14
  15       15,16
  16    15,16,17
  17       16,17
  18       18,19
  19    18,19,20
  20       19,20

I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:

  1        1,2,3
  5          5,6
  8          8,9
  11    11,12,13
  15       15,16
  18       18,19

I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.


Solution

  • One option for this particular example data is to create an igraph graph from row overlaps, detect connected components in resulting graph and use component's cluster id as a grouping variable. From there we can pick the first row from every group.

    library(dplyr)
    library(igraph)
    
    df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4", 
                                  "5,6", "5,6,7", "6,7", 
                                  "8,9", "8,9,10", "9,10", 
                                  "11,12,13", "11,12,13", 
                                  "11,12,13,14", "13,14", 
                                  "15,16", "15,16,17", "16,17", 
                                  "18,19", "18,19,20", "19,20"))
    
    
    df |> 
      mutate(id = row_number(), .before = 1) |> 
      group_by(
        g_clust = 
          strsplit(overlap, ",") |> 
          # either create a directed graph or set duplicate = FALSE for 
          # corner cases like `overlap = c("1", "1,2,3", ...)`
          graph_from_adj_list(mode = "all", duplicate = FALSE) |> 
          components() |> 
          getElement("membership")
        ) |> 
      slice_head(n = 1)
    #> # A tibble: 6 × 3
    #> # Groups:   g_clust [6]
    #>      id overlap  g_clust
    #>   <int> <chr>      <dbl>
    #> 1     1 1,2,3          1
    #> 2     5 5,6            2
    #> 3     8 8,9            3
    #> 4    11 11,12,13       4
    #> 5    15 15,16          5
    #> 6    18 18,19          6
    

    Overlaps graph for reference:

    strsplit(df$overlap, ",") |> 
      graph_from_adj_list(mode = "all", duplicate = FALSE) |>
      plot()
    

    graph