rstring-comparisonfuzzy-comparisonrecord-linkagejaro-winkler

Compare and link strings with different word orders / word counts


I am trying to use the recordLinkage package to link together two datasets where one dataset tends to give multiple last / middle names and the other just gives a single last name. Currently the string comparison function that's being used is the Jaro-Winkler function however the score returned is dependent on how the strings are matching up by chance instead of if the content of the shorter string is contained anywhere in the longer string. This is leading to many poor quality links being created. A reproducible example of the wrong weightings are as follows:

library(RecordLinkage)
data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeem", "lolli gaggen nazeem", "lolli gaggen nazeem"),
                           "bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )

data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem"),
                           "bday" = c("1908-08-08", "1979-12-12", "1560-06-06") ) )

blocking_variable <- c("bday")
pass <- compare.linkage(data1, data2, blockfld = blocking_variable, strcmp = T)
pass_weights <- epiWeights(pass)
getPairs(pass_weights, single.rows = TRUE)

  id1              lname.1     bday.1 id2 lname.2     bday.2    Weight
1   1 lolli gaggen nazheem 1908-08-08   1   lolli 1908-08-08 0.9162463
2   2 lolli gaggen nazheem 1979-12-12   2  gaggen 1979-12-12 0.8697165
3   3 lolli gaggen nazheem 1560-06-06   3 nazheem 1560-06-06 0.6995502

I want id's 2 & 3 to receive roughly the same weightings as id #1 however currently they are much lower since their last names are not in the exact same position in both datasets (although the content is agreeing). Is there a way I can modify the string comparison function being used here / the structure of the data so that I can take account of the different orderings?

Additional Notes:


Solution

  • Have you thought about the following approach?

    Record linkage and names are as I know you would know, difficult. Ideally you want to block on other available information (gender, unique identifiers, dob, location information etc.) and then do string comparisons on the names.

    You mention large datasets with millions of records. Look no further than the data.table package by the great Matt Dowle (https://stackoverflow.com/users/403310/matt-dowle).

    The RecordLinkage package is slow in comparison. You could easily improve the below code to think about string hashing techniques using soundex, double metaphone, nysiis etc.

    # install.packages("data.table")
    library(RecordLinkage)
    library(data.table)
    
    data1 <- as.data.frame(list("lname" = c("lolli gaggen nazeeem", "lolli gaggen nazeem", "lollly gaggen nazeem", "matt dowle", "john-smith"),
                               "bday" = c("1908-08-08", "1979-12-12", "1560-06-06", "1979-12-12", "1560-06-06") ) )
    
    data2 <- as.data.frame(list("lname" = c("lolli", "gaggen", "nazeem", "m dowl", "johnny smith"),
                               "bday" = c("1908-08-08", "1979-12-12", "1560-06-06", "1979-12-12", "1560-06-06") ) )
    
    
    # Coerce to data.tables
    setDT(data1)
    setDT(data2)
    
    # Define a regex split (we will split all words based on space or hyphen)
    split <- " |-"
    
    # Apply a blocking strategy based on bday. Ideally your dataset would allow for additional blocking strategies(?).
    block_pairs <- merge(data1, data2, by = "bday", all = T,
                sort = TRUE, suffixes = c(".x", ".y"))
    
    # Store the split up components of each comparison variable.
    split1 <- strsplit(block_pairs[["lname.x"]], split)
    split2 <- strsplit(block_pairs[["lname.y"]], split)
    
    # Perform jarowinkler comparisons on each combination of components of each string
    fc <- jarowinkler(block_pairs[["lname.x"]], block_pairs[["lname.y"]])
    pc <- mapply(function(x, y) max(outer(x, y, jarowinkler)), split1, split2)
    
    # Store the max of the full and partial comparisons
    block_pairs[, ("winkler.lname") := mapply(function(x,y) max(x,y), fc, pc)]
    
    
    # Sort by the jarowinkler score
    block_pairs <- block_pairs[order(winkler.lname)]
    
    # Inspect
    block_pairs
    
    # 0.96 is an appropriate threshold in this instance
    block_pairs <- block_pairs[winkler.lname >= 0.96]