I am trying to use the recordLinkage package to link together two datasets where one dataset tends to give multiple last / middle names and the other just gives a single last name. Currently the string comparison function that's being used is the Jaro-Winkler function however the score returned is dependent on how the strings are matching up by chance instead of if the content of the shorter string is contained anywhere in the longer string. This is leading to many poor quality links being created. A reproducible example of the wrong weightings are as follows:
library(RecordLinkage)
data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeem", "lolli gaggen nazeem", "lolli gaggen nazeem"),
"bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )
data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem"),
"bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )
blocking_variable <- c("bday")
pass <- compare.linkage(data1, data2, blockfld = blocking_variable, strcmp = T)
pass_weights <- epiWeights(pass)
getPairs(pass_weights, single.rows = TRUE)
id1 lname.1 bday.1 id2 lname.2 bday.2 Weight
1 1 lolli gaggen nazheem 1908-08-08 1 lolli 1908-08-08 0.9162463
2 2 lolli gaggen nazheem 1979-12-12 2 gaggen 1979-12-12 0.8697165
3 3 lolli gaggen nazheem 1560-06-06 3 nazheem 1560-06-06 0.6995502
I want id's 2 & 3 to receive roughly the same weightings as id #1 however currently they are much lower since their last names are not in the exact same position in both datasets (although the content is agreeing). Is there a way I can modify the string comparison function being used here / the structure of the data so that I can take account of the different orderings?
Additional Notes:
Both datasets have millions of rows so memory efficiency is definitely important here!
Sometimes the other dataset may have more than just a single last name so we'd be comparing 3 words against 2 words - would probably be best to start off with tackling the easy case first though
Have you thought about the following approach?
Record linkage and names are as I know you would know, difficult. Ideally you want to block on other available information (gender, unique identifiers, dob, location information etc.) and then do string comparisons on the names.
You mention large datasets with millions of records. Look no further than the data.table
package by the great Matt Dowle (https://stackoverflow.com/users/403310/matt-dowle).
The RecordLinkage package is slow in comparison. You could easily improve the below code to think about string hashing techniques using soundex, double metaphone, nysiis etc.
# install.packages("data.table")
library(RecordLinkage)
library(data.table)
data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeeem", "lolli gaggen nazeem", "lollly gaggen nazeem", "matt dowle", "john-smith"),
"bday" = c("1908-08-08", "1979-12-12", "1560-06-06", "1979-12-12", "1560-06-06") ) )
data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem", "m dowl", "johnny smith"),
"bday" = c("1908-08-08", "1979-12-12", "1560-06-06", "1979-12-12", "1560-06-06") ) )
# Coerce to data.tables
setDT(data1)
setDT(data2)
# Define a regex split (we will split all words based on space or hyphen)
split <- " |-"
# Apply a blocking strategy based on bday. Ideally your dataset would allow for additional blocking strategies(?).
block_pairs <- merge(data1, data2, by = "bday", all = T,
sort = TRUE, suffixes = c(".x", ".y"))
# Store the split up components of each comparison variable.
split1 <- strsplit(block_pairs[["lname.x"]], split)
split2 <- strsplit(block_pairs[["lname.y"]], split)
# Perform jarowinkler comparisons on each combination of components of each string
fc <- jarowinkler(block_pairs[["lname.x"]], block_pairs[["lname.y"]])
pc <- mapply(function(x, y) max(outer(x, y, jarowinkler)), split1, split2)
# Store the max of the full and partial comparisons
block_pairs[, ("winkler.lname") := mapply(function(x,y) max(x,y), fc, pc)]
# Sort by the jarowinkler score
block_pairs <- block_pairs[order(winkler.lname)]
# Inspect
block_pairs
# 0.96 is an appropriate threshold in this instance
block_pairs <- block_pairs[winkler.lname >= 0.96]