rstring-matchingfuzzy-searchfuzzy-comparison

How can I fuzzy match strings from two datasets?


I've been working on a way to join two datasets based on a imperfect string, such as a name of a company. In the past I had to match two very dirty lists, one list had names and financial information, another list had names and address. Neither had unique IDs to match on! ASSUME THAT CLEANING HAS ALREADY BEEN APPLIED AND THERE MAYBE TYPOS AND INSERTIONS.

So far AGREP is the closest tool I've found that might work. I can use levenshtein distances in the AGREP package, which measure the number of deletions, insertions and substitutions between two strings. AGREP will return the string with the smallest distance (the most similar).

However, I've been having trouble turning this command from a single value to apply it to an entire data frame. I've crudely used a for loop to repeat the AGREP function, but there's gotta be an easier way.

See the following code:

a<-data.frame(name=c('Ace Co','Bayes', 'asd', 'Bcy', 'Baes', 'Bays'),price=c(10,13,2,1,15,1))
b<-data.frame(name=c('Ace Co.','Bayes Inc.','asdf'),qty=c(9,99,10))

for (i in 1:6){
    a$x[i] = agrep(a$name[i], b$name, value = TRUE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
    a$Y[i] = agrep(a$name[i], b$name, value = FALSE, max = list(del = 0.2, ins = 0.3, sub = 0.4))
}

Solution

  • The solution depends on the desired cardinality of your matching a to b. If it's one-to-one, you will get the three closest matches above. If it's many-to-one, you will get six.

    One-to-one case (requires assignment algorithm):

    When I've had to do this before I treat it as an assignment problem with a distance matrix and an assignment heuristic (greedy assignment used below). If you want an "optimal" solution you'd be better off with optim.

    Not familiar with AGREP but here's example using stringdist for your distance matrix.

    library(stringdist)
    d <- expand.grid(a$name,b$name) # Distance matrix in long form
    names(d) <- c("a_name","b_name")
    d$dist <- stringdist(d$a_name,d$b_name, method="jw") # String edit distance (use your favorite function here)
    
    # Greedy assignment heuristic (Your favorite heuristic here)
    greedyAssign <- function(a,b,d){
      x <- numeric(length(a)) # assgn variable: 0 for unassigned but assignable, 
      # 1 for already assigned, -1 for unassigned and unassignable
      while(any(x==0)){
        min_d <- min(d[x==0]) # identify closest pair, arbitrarily selecting 1st if multiple pairs
        a_sel <- a[d==min_d & x==0][1] 
        b_sel <- b[d==min_d & a == a_sel & x==0][1] 
        x[a==a_sel & b == b_sel] <- 1
        x[x==0 & (a==a_sel|b==b_sel)] <- -1
      }
      cbind(a=a[x==1],b=b[x==1],d=d[x==1])
    }
    data.frame(greedyAssign(as.character(d$a_name),as.character(d$b_name),d$dist))
    

    Produces the assignment:

           a          b       d
    1 Ace Co    Ace Co. 0.04762
    2  Bayes Bayes Inc. 0.16667
    3    asd       asdf 0.08333
    

    I'm sure there's a much more elegant way to do the greedy assignment heuristic, but the above works for me.

    Many-to-one case (not an assignment problem):

    do.call(rbind, unname(by(d, d$a_name, function(x) x[x$dist == min(x$dist),])))
    

    Produces the result:

       a_name     b_name    dist
    1  Ace Co    Ace Co. 0.04762
    11   Baes Bayes Inc. 0.20000
    8   Bayes Bayes Inc. 0.16667
    12   Bays Bayes Inc. 0.20000
    10    Bcy Bayes Inc. 0.37778
    15    asd       asdf 0.08333
    

    Edit: use method="jw" to produce desired results. See help("stringdist-package")