ragrepapproximate

What is the logic of approximate string matching?


Does anybody know what is the reason for the following example:

agrepl("cold", "cool")
#> [1] FALSE
agrepl("cool", "cold")
#> [1] TRUE

Solution

  • Since the max distance defaults to:

    If cost is not given, all defaults to 10%, and the other transformation number bounds default to all. The component names can be abbreviated.

    And:

    Expressed either as integer, or as a fraction of the pattern length times the maximal transformation cost (will be replaced by the smallest integer not less than the corresponding fraction)

    The default maximum amount of transformations for a pattern of length 4 is 1. The cool-pattern matches the col in the beginning of the cold using only 1 deletion. Changing the cold to match cool would take at least two transformations (two subsitutions or one deletion and one insertion).

    These examples might explain it a bit further:

    agrepl("cold", "cool",max.distance = 1) # two changes necessary
    #> [1] FALSE
    agrepl("cold", "cool",max.distance = 2)
    #> [1] TRUE
    agrepl("cold", "coold") # just one addition necessary
    #> [1] TRUE