rstringsubstringlcsstringdist

How to calculate longest common substring anywhere in two strings


I am trying to calculate the longest exact common substring without gaps between a string and a vector of strings in R. How do I modify stringdist to return any common string anywhere in the two compared strings and return the distance?

Reproduce data:

string1 <- "whereiam"
vec1 <- c("firstiam","twoiswhereiaminthisvec","thisisthree","fouriamhere","fivewherehere")

Attempted stringdist function tried (doesnt work for my purposes):

library(stringdist)
stringdistvec <- stringdist(string1,vec1,method="lcs")
[1]  8 14 13 11 11  #not calculating the lcs type I want

Desired result instead with explanation of matches:

#desired to work to get this result:

desired_stringdistvec <- c(3,8,1,3,5)
[1]  3 8 1 3 5
#match 1: iam (3 common substr)
#match 2: whereiam (8 common substr)
#match 3: i (one letter only)
#match 5: iam (3 common substr)
#match 6: where (5 common substr)

Solution

  • One approach might be to look at the transformation sequence produced by adist() and count the characters in the longest contiguous match:

    trafos <- attr(adist(string1, vec1, counts = TRUE), "trafos")
    sapply(gregexpr("M+", trafos), function(x) max(0, attr(x, "match.length")))
    
    [1] 3 8 1 3 5