rstringlevenshtein-distancestringdist

Stringdist distance unexpectedly large


The following data has the surprising result that it does not match. I was expecting the distance to be 5, but even at 7 I get no match

library(fuzzyjoin)
one <- as.data.frame("Other field crops (non-organic)")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 7, ignore_case=TRUE)

                              A.x  A.y
1 Other field crops (non-organic) <NA>

Only at 10 I get a match..

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 10, ignore_case=TRUE)
                              A.x                           A.y
1 Other field crops (non-organic) other_field_crops_non_organic

Could someone explain to me why this distance larger than 9? Does it have to do with the brackets? And if so how can I circumvent this issue without removing the brackets?

EDIT

library(fuzzyjoin)
one <- as.data.frame("Other field crops non-organic")
names(one) <- "A"
two <-  as.data.frame("other_field_crops_non_organic")
names(two) <- "A"

stringdist_left_join(one, two, by = "A", method = "lcs", max_dist = 5, ignore_case=TRUE)
                            A.x  A.y
1 Other field crops non-organic <NA>

Even without the brackets I cannot get the distance within 5.


Solution

  • The problem comes down to the method you are using to calculate the string distance. You are using the lcs (longest common substring) method, which in effect only allows deletions and insertions rather than substitutions. From the docs:

    The longest common substring (method='lcs') is defined as the longest string that can be obtained by pairing characters from a and b while keeping the order of characters intact. The lcs-distance is defined as the number of unpaired characters. The distance is equivalent to the edit distance allowing only deletions and insertions, each with weight one.

    So when we convert spaces to underscores, we incur a weighting of 2 per substitution:

    stringdist('abc def', 'abc_def', method = 'lcs')
    #> [1] 2
    

    This is in contrast to the default 'osa' method, which like the Levenshtein distance and the R function adist allows direct substitutions, with only a 1-point weighting:

    stringdist('abc def', 'abc_def', method = 'osa')
    #> [1] 1
    

    You can compare how the different stringdist methods compare on your two strings. To further simplify, let's make both lowercase since you are already specifying ignore_case in your left join:

    library(stringdist)
    
    a <- "other field crops (non-organic)"
    b <- "other_field_crops_non_organic"
    methods <- c("osa", "lv", "dl", "hamming", "lcs", 
                 "qgram", "cosine", "jaccard", "jw", "soundex")
    
    sapply(methods, function(x) stringdist(a, b, method = x))
    #>        osa         lv         dl    hamming        lcs      qgram     cosine 
    #>  6.0000000  6.0000000  6.0000000        Inf 10.0000000 10.0000000  0.2025635 
    #>    jaccard         jw    soundex 
    #>  0.2500000  0.1104931  0.0000000
    

    You can see that the Hamming distance is infinite, since your strings are of different length, and osa (the default method) is only 6, but lcs requires 10 (4 removals of underscores, 3 additions of spaces, one addition of a hyphen, and two additions of parentheses). If this string pair is representative of your data, you might want to switch to "osa"

    Created on 2022-04-14 by the reprex package (v2.0.1)