regex

RegEx for word boundary but still match if is preceded or followed by special chars


What I want to accomplish is to match any word even if they are followed or preceded by non-alphanumeric characters.

So for example for the following string This string contains word1 and word2* and anotherword1, I would like to get two matches for word1 and word2 but not anotherword1 nor word1 in the anotherword1.

What I have right now is

\b(word1|word2)\b

but this will not match for word2 (ignoring the *).

From what I read \b only matches between an alphanumeric character and a non-alphanumeric character but I have no idea how to handle this special chars trailing my targeted words.

LE: I think (?i)(?<=^|[^a-zA-Z0-9])(word1|word2)(?=$|[^a-zA-Z0-9]) does the trick ... but does it look ok? Is it a simpler way of doing this?


Solution

  • You are looking for an adaptive word boundary (yes, it is my concept that I described here):

    (?!\B\w)(word1|word2)(?!\B\w)
    

    Or, if you just want to make sure there is no word char on both ends:

    (?<!\w)(word1|word2)(?!\w)
    

    The (?<!\w) and (?!\w) lookarounds are unambiguous leading ((?<!\w)) and trailing ((?!\w)) word boundaries.

    The \b construct meaning depends on the context: \bw will match a w in *w as it will require a non-word character before \b, but \b\* will require a word character before * as * is a non-word character.

    In languages that do not support lookbehinds, the (?<!\w) should be replaced with (^|\W) and further manipulations should be done in the code.