rdplyrtidyrdata-manipulation

Compare each unique name-value combination to name-value combinations with other names


In R, I have a data-frame with unique name-value combinations:

data <- data.frame(
  name = c('a', 'a', 'b', rep('c', 3)),
  value = c('d1', 'd12', 'd123', 'b1', 'c12', 'd1234')
)

> data
  name value
1    a    d1
2    a   d12
3    b  d123
4    c    b1
5    c   c12
6    c d1234

I need a data.frame, where each unique name-value combination is compared to name-value combinations with other names, i.e., all combinations with name 'a' should be compared to all combinations with name 'b' or 'c', but 'a-d1' should not be compared to 'a-d12'. "Compared to" means that they are on the same row:

data2 <- data.frame(
  name1 = c(rep('a', 8), rep('b', 3)),
  name2 = c(rep(c('b', 'c', 'c', 'c'), 2), rep('c', 3)),
  value1 = c(rep('d1', 4), rep('d12', 4), rep('d123', 3)),
  value2 = c(rep(c('d123', 'b1', 'c12', 'd1234'), 2), 'b1', 'c12', 'd1234')
)

> data2
   name1 name2 value1 value2
1      a     b     d1   d123
2      a     c     d1     b1
3      a     c     d1    c12
4      a     c     d1  d1234
5      a     b    d12   d123
6      a     c    d12     b1
7      a     c    d12    c12
8      a     c    d12  d1234
9      b     c   d123     b1
10     b     c   d123    c12
11     b     c   d123  d1234

My real data has 19.000 unique names and 40.000 unique values which form 40.000 unique name-value combinations (each value has only 1 name), so speed is key. Percentiles of number of values per name looks like this (i.e., 99% names have 1-8 values)

25% 50% 75% 90% 99% 
  1   2   3   4   8 

Solution

  • 1) Perform a complex self join as shown. Add the dbname=tempfile() sqldf argument if you need to perform it out of memory.

    library(sqldf)
    
    sqldf("select a.name name1, b.name name2, a.value value1, b.value value2
      from data a join data b
      on a.name < b.name")
    

    giving

       name1 name2 value1 value2
    1      a     b     d1   d123
    2      a     c     d1     b1
    3      a     c     d1    c12
    4      a     c     d1  d1234
    5      a     b    d12   d123
    6      a     c    d12     b1
    7      a     c    d12    c12
    8      a     c    d12  d1234
    9      b     c   d123     b1
    10     b     c   d123    c12
    11     b     c   d123  d1234
    

    2) dplyr or with dplyr

    library(dplyr)
    data %>%
      inner_join(data, join_by(name < name))
    

    3) Base R Use combn with only base R:

    combn(nrow(data), 2, function(ix) { 
      i <- ix[1]
      j <- ix[2]
      c(data$name[i], data$name[j], data$value[i], data$value[j])
    }) |>
    t() |>
    as.data.frame() |>
    setNames(c("name1", "name2", "value1", "value2")) |>
    subset(name1 != name2)