rdplyrset-intersection

How to find intersection between all possible pairs of sets in a 2-column table?


I want to calculate an overlap coefficient between sets. My data comes as a 2-column table, such as:

df_example <- 
  tibble::tribble(~my_group, ~cities,
                   "foo",   "london",
                   "foo",   "paris", 
                   "foo",   "rome", 
                   "foo",   "tokyo",
                   "foo",   "oslo",
                   "bar",   "paris", 
                   "bar",   "nyc",
                   "bar",   "rome", 
                   "bar",   "munich",
                   "bar",   "warsaw",
                   "bar",   "sf", 
                   "baz",   "milano",
                   "baz",   "oslo",
                   "baz",   "sf",  
                   "baz",   "paris")

In df_example, I have 3 sets (i.e., foo, bar, baz), and members of each set are given in cities.

I would like to end up with a table that intersects all possible pairs of sets, and specifies the size of the smaller set in each pair. This will give rise to calculating an overlap coefficient for each pair of sets.

(Overlap coefficient = number of common members / size of smaller set)

Desired Output

## # A tibble: 3 × 4
##   combination n_instersected_members size_of_smaller_set  overlap_coeff
##   <chr>                        <dbl>               <dbl>          <dbl>
## 1 foo*bar                          2                   5           0.4 
## 2 foo*baz                          3                   4           0.75
## 3 bar*baz                          2                   4           0.5 

Is there a simple enough way to get this done with dplyr verbs? I've tried

df_example |> 
  group_by(my_group) |> 
  summarise(intersected = dplyr::intersect(cities))

But this won't work, obviously, because dplyr::intersect() expects two vectors. Is there a way to get to the desired output similar to my dplyr direction?


Solution

  • Here is a base R option using combn

    do.call(
        rbind,
        combn(
            with(
                df_example,
                split(cities, my_group)
            ),
            2,
            \(x)
            transform(
                data.frame(
                    combo = paste0(names(x), collapse = "-"),
                    nrIntersect = sum(x[[1]] %in% x[[2]]),
                    szSmallSet = min(lengths(x))
                ),
                olCoeff = nrIntersect / szSmallSet
            ),
            simplify = FALSE
        )
    )
    

    which gives

        combo nrIntersect szSmallSet olCoeff
    1 bar-baz           2          4     0.5
    2 bar-foo           2          5     0.4
    3 baz-foo           2          4     0.5