I have some trouble to understand the result of agrep()
function. I don't understand what I have missed in the description of the function.
agrep()
is for fuzzy matching and I'd like to use it to correct some misspelling. I'd like to allow only a maximum of 2 insertions / deletions / substitutions.
Here is my code just for an example:
check=c("73SAINTGERMAINLACHAMBOTTE","73CHAMBERY")
agrep("73SAINTGERVAIS",check,ignore.case=TRUE,max.distance=2,value=TRUE)
Here, what I expect is to have no answer for this request because I can't transform "73SAINTGERVAIS"
into "73SAINTGERMAINLACHAMBOTTE"
or "73CHAMBERY"
with a maximum of 2 insertions / deletions / substitutions.
However, the result is :
[1] "73SAINTGERMAINLACHAMBOTTE"
Does it mean that the notion of insertions / deletions / substitutions isn't character-based (I mean the string "MAINLACHALBOTTE"
is considered as 1 insertion)?
That is because it is doing partial matching as well. For example, '73SAINTGERVAIS'
is two distance away from the substring '73SAINTGERMAIN'
.
You may want to try adist
instead like this:
check=c("73SAINTGERMAINLACHAMBOTTE","73CHAMBERY", "73SAINTGERMAIN")
adist("73SAINTGERVAIS",check) <= 2
[,1] [,2] [,3]
[1,] FALSE FALSE TRUE
If you want the vector of matched input strings as output, you can further do the following:
check[as.logical(adist("73SAINTGERVAIS",check) <= 2)]