I am trying to link two sets of anonymous ID (ID1 and ID3), via a lookup to another ID (ID2). I want to keep only one-to-one matches between ID1 and ID3, but this is tricky as there are many-to-many matches between each ID.
library(data.table)
Table1 <- data.table(ID1 = c(1, 1, 2, 3, 4, 5, 5, 6),
ID2 = c(101, 102, 102, 103, 104, 105, 106, 107))
Table2 <- data.table(ID2 = c(101, 102, 103, 103, 104, 105, 106, 108),
ID3 = c(201, 202, 203, 204, 205, 206, 206, 207))
I have tried joining the tables. This is successful in getting rid of cases where ID2 only occurs in one of the tables.
setkey(Table1,ID2)
setkey(Table2,ID2)
merged_table <- Table2[Table1,nomatch=0]
However, it still contains cases where ID1 is linking to multiple ID3 values, and vice versa.
I want to end up with a table like this:
ID1 ID3
4 205
5 206
Ideally I would like to do this using data.table
in R.
Vectorize the filtering on the joined table:
unique(
Table2[
Table1, .(ID1, ID3), on = "ID2", nomatch=0,
allow.cartesian=TRUE
]
)[!(duplicated(ID1) | duplicated(ID1, fromLast = TRUE) |
duplicated(ID3) | duplicated(ID3, fromLast = TRUE))]
#> ID1 ID3
#> 1: 4 205
#> 2: 5 206
Implement a few of the answers so far as functions:
f1 <- function(Table1, Table2) {
unique(
Table2[
Table1, .(ID1, ID3), on = "ID2", nomatch=0,
allow.cartesian=TRUE
]
)[!(duplicated(ID1) | duplicated(ID1, fromLast = TRUE) |
duplicated(ID3) | duplicated(ID3, fromLast = TRUE))]
}
f2 <- function(Table1, Table2) {
# adapted from rw2
setkey(Table1, ID2)
setkey(Table2, ID2)
combined_lookups <- Table2[Table1, nomatch=0, allow.cartesian=TRUE]
# Count the number of matches in both directions
combined_lookups[, count1 := uniqueN(ID1), by = ID3]
combined_lookups[, count2 := uniqueN(ID3), by = ID1]
# Then remove cases where ID's match multiple person_ids, and vice versa. Also remove duplicate rows.
unique(combined_lookups[count1==1 & count2==1][,.(ID1,ID3)])
}
f3 <- function(Table1, Table2) {
# adapted from @s_baldur
foo <- \(x) x %in% names(which(table(x) == 1L))
Table2[Table1, on = "ID2", nomatch=0, .(ID1, ID3), allow.cartesian=TRUE
][, unique(.SD)
][foo(ID1) & foo(ID3)]
}
Create a large dataset:
Table1 <- data.table(ID1 = cumsum(runif(1e6)%/%0.5),
ID2 = cumsum(c(1e6, runif(999999)%/%0.5)))
Table2 <- data.table(ID2 = cumsum(c(1e6, runif(999999)%/%0.5)),
ID3 = cumsum(c(2e6, runif(999999)%/%0.5)))
Time the functions on the larger dataset:
system.time(dt1 <- f1(Table1, Table2))
#> user system elapsed
#> 0.70 0.01 0.49
system.time(dt2 <- f2(Table1, Table2))
#> user system elapsed
#> 14.89 3.58 18.05
system.time(dt3 <- f3(Table1, Table2))
#> user system elapsed
#> 5.69 0.08 5.36
Check for equivalence:
identical(dt1, dt2)
#> [1] TRUE
identical(dt1, dt3)
#> [1] TRUE