In R, I have a data-frame with unique name-value combinations:
data <- data.frame(
name = c('a', 'a', 'b', rep('c', 3)),
value = c('d1', 'd12', 'd123', 'b1', 'c12', 'd1234')
)
> data
name value
1 a d1
2 a d12
3 b d123
4 c b1
5 c c12
6 c d1234
I need a data.frame, where each unique name-value combination is compared to name-value combinations with other names, i.e., all combinations with name 'a' should be compared to all combinations with name 'b' or 'c', but 'a-d1' should not be compared to 'a-d12'. "Compared to" means that they are on the same row:
data2 <- data.frame(
name1 = c(rep('a', 8), rep('b', 3)),
name2 = c(rep(c('b', 'c', 'c', 'c'), 2), rep('c', 3)),
value1 = c(rep('d1', 4), rep('d12', 4), rep('d123', 3)),
value2 = c(rep(c('d123', 'b1', 'c12', 'd1234'), 2), 'b1', 'c12', 'd1234')
)
> data2
name1 name2 value1 value2
1 a b d1 d123
2 a c d1 b1
3 a c d1 c12
4 a c d1 d1234
5 a b d12 d123
6 a c d12 b1
7 a c d12 c12
8 a c d12 d1234
9 b c d123 b1
10 b c d123 c12
11 b c d123 d1234
My real data has 19.000 unique names and 40.000 unique values which form 40.000 unique name-value combinations (each value has only 1 name), so speed is key. Percentiles of number of values per name looks like this (i.e., 99% names have 1-8 values)
25% 50% 75% 90% 99%
1 2 3 4 8
1) Perform a complex self join as shown. Add the dbname=tempfile()
sqldf
argument if you need to perform it out of memory.
library(sqldf)
sqldf("select a.name name1, b.name name2, a.value value1, b.value value2
from data a join data b
on a.name < b.name")
giving
name1 name2 value1 value2
1 a b d1 d123
2 a c d1 b1
3 a c d1 c12
4 a c d1 d1234
5 a b d12 d123
6 a c d12 b1
7 a c d12 c12
8 a c d12 d1234
9 b c d123 b1
10 b c d123 c12
11 b c d123 d1234
2) dplyr or with dplyr
library(dplyr)
data %>%
inner_join(data, join_by(name < name))
3) Base R Use combn
with only base R:
combn(nrow(data), 2, function(ix) {
i <- ix[1]
j <- ix[2]
c(data$name[i], data$name[j], data$value[i], data$value[j])
}) |>
t() |>
as.data.frame() |>
setNames(c("name1", "name2", "value1", "value2")) |>
subset(name1 != name2)