I want to calculate an overlap coefficient between sets. My data comes as a 2-column table, such as:
df_example <-
tibble::tribble(~my_group, ~cities,
"foo", "london",
"foo", "paris",
"foo", "rome",
"foo", "tokyo",
"foo", "oslo",
"bar", "paris",
"bar", "nyc",
"bar", "rome",
"bar", "munich",
"bar", "warsaw",
"bar", "sf",
"baz", "milano",
"baz", "oslo",
"baz", "sf",
"baz", "paris")
In df_example
, I have 3 sets (i.e., foo
, bar
, baz
), and members of each set are given in cities
.
I would like to end up with a table that intersects all possible pairs of sets, and specifies the size of the smaller set in each pair. This will give rise to calculating an overlap coefficient for each pair of sets.
(Overlap coefficient = number of common members / size of smaller set)
Desired Output
## # A tibble: 3 × 4
## combination n_instersected_members size_of_smaller_set overlap_coeff
## <chr> <dbl> <dbl> <dbl>
## 1 foo*bar 2 5 0.4
## 2 foo*baz 3 4 0.75
## 3 bar*baz 2 4 0.5
Is there a simple enough way to get this done with dplyr verbs? I've tried
df_example |>
group_by(my_group) |>
summarise(intersected = dplyr::intersect(cities))
But this won't work, obviously, because dplyr::intersect()
expects two vectors. Is there a way to get to the desired output similar to my dplyr direction?
Here is a base R option using combn
do.call(
rbind,
combn(
with(
df_example,
split(cities, my_group)
),
2,
\(x)
transform(
data.frame(
combo = paste0(names(x), collapse = "-"),
nrIntersect = sum(x[[1]] %in% x[[2]]),
szSmallSet = min(lengths(x))
),
olCoeff = nrIntersect / szSmallSet
),
simplify = FALSE
)
)
which gives
combo nrIntersect szSmallSet olCoeff
1 bar-baz 2 4 0.5
2 bar-foo 2 5 0.4
3 baz-foo 2 4 0.5