rregexstring-matchinggrepl

String matching with NEAR regex and multiple terms


I have a vector containing some strings, like the following:

test_strings <- c("this string referring to dummy text should be matched", 
                  "this string referring to an example of code should be matched",
                  "this string referring to texts which are kind of dumb should be matched",
                  "this string referring to an example, but with a really long gap before mentioning a word such as 'text' should not be matched")

I have two lists of search terms:

list_a <- c("dummy", "dumb", "example", "examples")
list_b <- c("text", "texts", "script", "scripts", "code")

I would like to return matches where there is some combination of a string from list_a and a string from list_b, with these strings appearing within 10 words of each other (i.e. elements 1-3 of test_strings).

Based on the helpful answers to this question: R - Searching text with NEAR regex, I was able to implement the 'NEAR' function, but my code fails to return the correct matches once I include multiple terms, some of which are substrings.

Here is what I have tried so far:

regex_string <- "\\b(?:(dum|example)\\W+(?:\\w+\\W+){0,10}?(text|script|code)|(text|script|code)\\W+(?:\\w+\\W+){0,10}?(dum|example))\\b"

test_results <- test_strings[grepl(regex_string,test_strings, ignore.case=TRUE)]

test_results

Only returns strings with an exact match - i.e. "this string referring to an example of code should be matched"

regex_string <- "\\b(?:(dum.*|example.*)\\W+(?:\\w+\\W+){0,10}?(text.*|script.*|code)|(text.*|script.*|code)\\W+(?:\\w+\\W+){0,10}?(dum.*|example.*))\\b"

test_results <- test_strings[grepl(regex_string,test_strings, ignore.case=TRUE)]

test_results

Allows me to match substrings so that "this string referring to dummy text should be matched", "this string referring to an example of code should be matched" and "this string referring to texts which are kind of dumb should be matched" are returned.

However, "this string referring to an example, but with a really long gap before mentioning a word such as 'text' should not be matched" is also returned, I guess as the inclusion of ".*" is somehow invalidating the 0-10 word restriction.

Any ideas on how I could fix this?


Solution

  • If you really need a regex, then this one should work:

    regex_string <- r"(\b(?:dum|example)\w*(?:\W+\w+){0,10}\W+(?:text\w*|script\w*|code)\b|\b(?:text\w*|script\w*|code)(?:\W+\w+){0,10}\W+(?:dum|example)\w*\b)"
    

    .* didn't work because it's greedy, also variable length matching of anything with no strict boundaries isn't usually a good idea.

    Explantation:

    Demo with further explantations.

    Using text-based solutions for such tasks is often more optimal though.