rlevenshtein-distanceagrep

Extract substring match from agrep


My Goal is to identify whether a given text has a target string in it, but i want to allow for typos / small derivations and extract the substring that "caused" the match (to use it for further text analysis).

Example:

target <- "target string"
text <- "the target strlng: Butter. this text i dont want to extract."

Desired Output:

I would like to have target strlng as the Output, since ist very Close to the target (levenshtein distance of 1). And next i want to use target strlng to extract the word Butter (This part i have covered, i just add it to have a detailed spec).

What i tried:

Using adist did not work, since it compares two strings, not substrings.

Next i took a look at agrep which seems very Close. I can have the Output, that my target was found, but not the substring that "caused" the match.

I tried with value = TRUE but it seems to work on Array Level. I think It is not possible for me to Switch to Array type, because i can not split by spaces (my target string might have spaces,...).

agrep(
  pattern = target, 
  x = text,
  value = TRUE
)

Solution

  • Use aregexec, it's similar to the use of regexpr/regmatches (or gregexpr) for exact matches extraction.

    m <- aregexec('string', 'text strlng wrong')
    regmatches('text strlng wrong', m)
    #[[1]]
    #[1] "strlng"
    

    This can be wrapped in a function that uses the arguments of both aregexec and regmatches. Note that in the latter case, the function argument invert comes after the dots argument ... so it must be a named argument.

    aregextract <- function(pattern, text, ..., invert = FALSE){
      m <- aregexec(pattern, text, ...)
      regmatches(text, m, invert = invert)
    }
    
    aregextract(target, text)
    #[[1]]
    #[1] "target strlng"
    
    aregextract(target, text, invert = TRUE)
    #[[1]]
    #[1] "the "                                       
    #[2] ": Butter. this text i dont want to extract."