Obtain one to one matches from many-to-many linkages, using data.table R

I am trying to link two sets of anonymous ID (ID1 and ID3), via a lookup to another ID (ID2). I want to keep only one-to-one matches between ID1 and ID3, but this is tricky as there are many-to-many matches between each ID.

library(data.table)
Table1 <- data.table(ID1 = c(1, 1, 2, 3, 4, 5, 5, 6),
                 ID2 = c(101, 102, 102, 103, 104, 105, 106, 107))

Table2 <- data.table(ID2 = c(101, 102, 103, 103, 104, 105, 106, 108),
                 ID3 = c(201, 202, 203, 204, 205, 206, 206, 207))

I have tried joining the tables. This is successful in getting rid of cases where ID2 only occurs in one of the tables.

setkey(Table1,ID2)
setkey(Table2,ID2)
merged_table <- Table2[Table1,nomatch=0]

However, it still contains cases where ID1 is linking to multiple ID3 values, and vice versa.

I want to end up with a table like this:

ID1    ID3
  4    205
  5    206

Ideally I would like to do this using data.table in R.

Solution

Vectorize the filtering on the joined table:

unique(
  Table2[
    Table1, .(ID1, ID3), on = "ID2", nomatch=0,
    allow.cartesian=TRUE
  ]
)[!(duplicated(ID1) | duplicated(ID1, fromLast = TRUE) |
      duplicated(ID3) | duplicated(ID3, fromLast = TRUE))]
#>    ID1 ID3
#> 1:   4 205
#> 2:   5 206

Timings

Implement a few of the answers so far as functions:

f1 <- function(Table1, Table2) {
  unique(
    Table2[
      Table1, .(ID1, ID3), on = "ID2", nomatch=0,
      allow.cartesian=TRUE
    ]
  )[!(duplicated(ID1) | duplicated(ID1, fromLast = TRUE) |
        duplicated(ID3) | duplicated(ID3, fromLast = TRUE))]
}

f2 <- function(Table1, Table2) {
  # adapted from rw2
  setkey(Table1, ID2)
  setkey(Table2, ID2)
  combined_lookups <- Table2[Table1, nomatch=0, allow.cartesian=TRUE]
  
  # Count the number of matches in both directions
  combined_lookups[, count1 := uniqueN(ID1), by = ID3]
  combined_lookups[, count2 := uniqueN(ID3), by = ID1]
  
  # Then remove cases where ID's match multiple person_ids, and vice versa. Also remove duplicate rows.
  unique(combined_lookups[count1==1 & count2==1][,.(ID1,ID3)])
}

f3 <- function(Table1, Table2) {
  # adapted from @s_baldur
  foo <- \(x) x %in% names(which(table(x) == 1L))
  
  Table2[Table1, on = "ID2", nomatch=0, .(ID1, ID3), allow.cartesian=TRUE
  ][, unique(.SD)
  ][foo(ID1) & foo(ID3)] 
}

Create a large dataset:

Table1 <- data.table(ID1 = cumsum(runif(1e6)%/%0.5),
                     ID2 = cumsum(c(1e6, runif(999999)%/%0.5)))
Table2 <- data.table(ID2 = cumsum(c(1e6, runif(999999)%/%0.5)),
                     ID3 = cumsum(c(2e6, runif(999999)%/%0.5)))

Time the functions on the larger dataset:

system.time(dt1 <- f1(Table1, Table2))
#>    user  system elapsed 
#>    0.70    0.01    0.49
system.time(dt2 <- f2(Table1, Table2))
#>    user  system elapsed 
#>   14.89    3.58   18.05
system.time(dt3 <- f3(Table1, Table2))
#>    user  system elapsed 
#>    5.69    0.08    5.36

Check for equivalence:

identical(dt1, dt2)
#> [1] TRUE
identical(dt1, dt3)
#> [1] TRUE