pythonregexstringmatch

finding indices of exact match words in python


I'm trying to find the indices of a pattern in a sentence. The pattern can be a word or a combination of words. I've used regular expressions for this. But I've some edge cases to handle.

import re

word = "is"
s = "Is (valid) is (valid), is-not (not valid), is. (valid) is!, (valid), is_1 (not valid) ,is (valid), is? (valid)"

iters = re.finditer(r"\b" + re.escape(word) + r"\b", s, re.I)
indices = [m.start(0) for m in iters]
print(indices)

This outputs

[0, 11, 23, 43, 55, 87, 99]

As you can see, the occurrence of is with certain symbols are required matches and some are not. Here is the list of valid symbols which can be taken into consideration for a match.

[" ", ",", ".", "!", "?"]

How to avoid the 3rd match (is-not) from the results?


Solution

  • Your question is a little ambiguous in that you are specifying some specific characters as boundary characters (rather than any non-word character being a boundary character) and yet you are using the "\b" word boundary assertion in your code (which uses any non-word character as a boundary character). Thus, I cannot be sure if you simply want to adjust "\b" to not consider "-" as a boundary character or if you want to rewrite your regular expression to use exactly the boundary characters specified in your question.

    To adjust "\b" to ignore "-" as a boundary character, you can use a negative lookbehind assertion and a negative lookahead assertion (to say basically, "unless the boundary is caused by the dash character") so only one line of your code would change:

        iters = re.finditer(r"(?<!-)\b" + re.escape(word) + r"\b(?!-)", s, re.I)
    

    This change causes the output to become

        [0, 11, 43, 55, 87, 99]
    

    which seems to be what you wanted. Just keep in mind that there are other non-word characters (in addition to the ones you mentioned) that would cause the regular expression to match (in a generalized string, as opposed to the one you supplied).

    I am not going to supply a regular expression at this time for handling just the characters you specified because your example code used "\b" implying you wanted to use that, but just have it not consider "-" as a boundary character (and thus also implying that you listed boundary characters mostly from your example and did not make the list all-inclusive).