rstring-matchingagrep

R agrep() function behaviour


I have some trouble to understand the result of agrep() function. I don't understand what I have missed in the description of the function. agrep() is for fuzzy matching and I'd like to use it to correct some misspelling. I'd like to allow only a maximum of 2 insertions / deletions / substitutions.

Here is my code just for an example:

check=c("73SAINTGERMAINLACHAMBOTTE","73CHAMBERY")
agrep("73SAINTGERVAIS",check,ignore.case=TRUE,max.distance=2,value=TRUE)

Here, what I expect is to have no answer for this request because I can't transform "73SAINTGERVAIS" into "73SAINTGERMAINLACHAMBOTTE" or "73CHAMBERY" with a maximum of 2 insertions / deletions / substitutions. However, the result is :

[1] "73SAINTGERMAINLACHAMBOTTE"

Does it mean that the notion of insertions / deletions / substitutions isn't character-based (I mean the string "MAINLACHALBOTTE" is considered as 1 insertion)?


Solution

  • That is because it is doing partial matching as well. For example, '73SAINTGERVAIS' is two distance away from the substring '73SAINTGERMAIN'.

    You may want to try adist instead like this:

    check=c("73SAINTGERMAINLACHAMBOTTE","73CHAMBERY", "73SAINTGERMAIN")
    adist("73SAINTGERVAIS",check) <= 2
          [,1]  [,2] [,3]
    [1,] FALSE FALSE TRUE
    

    If you want the vector of matched input strings as output, you can further do the following:

    check[as.logical(adist("73SAINTGERVAIS",check) <= 2)]