ragrep

Why does agrep in R not find the best match?


I am attempting string matching in R using the agrep command. However I am concerned that it stops when it finds a good match, rather than optimize to find the best one. Though it is possible my understanding of how it works is incorrect. My example below reproduces the problem, albeit crudely.

example1 <- c("height","weight")
example2 <- c("height","weight")

y <- c("","")
for( i in 1: 2 ){
x <- agrep(example1[i], example2, max.distance = 1, ignore.case=TRUE, value=TRUE, useBytes=TRUE ) 
x <- paste0(x,"")
y[i] <- x
  }

As you will hopefully see, agrep has matched weight to height, when weight is the better match and also present.

Why is this?


Solution

  • You can try adist (for generalized Levenshtein (edit) distance), with the following result ('height' from example1 best matches with height from example2 etc.):

    adist(example1, example2)
         [,1] [,2]
    [1,]    0    1
    [2,]    1    0
    
    example2[apply(adist(example1, example2), 1, which.min)]
    # [1] "height" "weight"