In R, I have the following dataframe with the column "overlap" listing rows that have overlapping values on some other column.
df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4",
"5,6", "5,6,7", "6,7",
"8,9", "8,9,10", "9,10",
"11,12,13", "11,12,13",
"11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17",
"18,19", "18,19,20", "19,20"))
df
overlap
1 1,2,3
2 1,2,3
3 1,2,3,4
4 3,4
5 5,6
6 5,6,7
7 6,7
8 8,9
9 8,9,10
10 9,10
11 11,12,13
12 11,12,13
13 11,12,13,14
14 13,14
15 15,16
16 15,16,17
17 16,17
18 18,19
19 18,19,20
20 19,20
I would like to identify rows with common values, even if those values are not in all rows, and then keep only 1 of the rows. For example, rows 1-4 contain the combined set 1,2,3,4 and I would like to keep only one of these rows. If we keep the first row, the resulting df would be:
1 1,2,3
5 5,6
8 8,9
11 11,12,13
15 15,16
18 18,19
I've searched many other solutions on here and none include uneven rows lengths, and which is vital as the full data can have rows with dozens of values.
One option for this particular example data is to create an igraph
graph from row overlaps, detect connected components in resulting graph and use component's cluster id as a grouping variable. From there we can pick the first row from every group.
library(dplyr)
library(igraph)
df <- data.frame(overlap = c("1,2,3", "1,2,3", "1,2,3,4", "3,4",
"5,6", "5,6,7", "6,7",
"8,9", "8,9,10", "9,10",
"11,12,13", "11,12,13",
"11,12,13,14", "13,14",
"15,16", "15,16,17", "16,17",
"18,19", "18,19,20", "19,20"))
df |>
mutate(id = row_number(), .before = 1) |>
group_by(
g_clust =
strsplit(overlap, ",") |>
# either create a directed graph or set duplicate = FALSE for
# corner cases like `overlap = c("1", "1,2,3", ...)`
graph_from_adj_list(mode = "all", duplicate = FALSE) |>
components() |>
getElement("membership")
) |>
slice_head(n = 1)
#> # A tibble: 6 × 3
#> # Groups: g_clust [6]
#> id overlap g_clust
#> <int> <chr> <dbl>
#> 1 1 1,2,3 1
#> 2 5 5,6 2
#> 3 8 8,9 3
#> 4 11 11,12,13 4
#> 5 15 15,16 5
#> 6 18 18,19 6
Overlaps graph for reference:
strsplit(df$overlap, ",") |>
graph_from_adj_list(mode = "all", duplicate = FALSE) |>
plot()