[SOLVED] RegEx for word boundary but still match if is preceded or followed by special chars

RegEx for word boundary but still match if is preceded or followed by special chars

What I want to accomplish is to match any word even if they are followed or preceded by non-alphanumeric characters.

So for example for the following string This string contains word1 and word2* and anotherword1, I would like to get two matches for word1 and word2 but not anotherword1 nor word1 in the anotherword1.

What I have right now is

\b(word1|word2)\b

but this will not match for word2 (ignoring the *).

From what I read \b only matches between an alphanumeric character and a non-alphanumeric character but I have no idea how to handle this special chars trailing my targeted words.

LE: I think (?i)(?<=^|[^a-zA-Z0-9])(word1|word2)(?=$|[^a-zA-Z0-9]) does the trick ... but does it look ok? Is it a simpler way of doing this?

Solution

You are looking for an adaptive word boundary (yes, it is my concept that I described here):

(?!\B\w)(word1|word2)(?!\B\w)

Or, if you just want to make sure there is no word char on both ends:

(?<!\w)(word1|word2)(?!\w)

The (?<!\w) and (?!\w) lookarounds are unambiguous leading ((?<!\w)) and trailing ((?!\w)) word boundaries.

The \b construct meaning depends on the context: \bw will match a w in *w as it will require a non-word character before \b, but \b\* will require a word character before * as * is a non-word character.

In languages that do not support lookbehinds, the (?<!\w) should be replaced with (^|\W) and further manipulations should be done in the code.