What I want to accomplish is to match any word even if they are followed or preceded by non-alphanumeric characters.
So for example for the following string This string contains word1 and word2* and anotherword1
, I would like to get two matches for word1
and word2
but not anotherword1
nor word1
in the anotherword1
.
What I have right now is
\b(word1|word2)\b
but this will not match for word2
(ignoring the *).
From what I read \b
only matches between an alphanumeric character and a non-alphanumeric character but I have no idea how to handle this special chars trailing my targeted words.
LE: I think (?i)(?<=^|[^a-zA-Z0-9])(word1|word2)(?=$|[^a-zA-Z0-9])
does the trick ... but does it look ok? Is it a simpler way of doing this?
You are looking for an adaptive word boundary (yes, it is my concept that I described here):
(?!\B\w)(word1|word2)(?!\B\w)
Or, if you just want to make sure there is no word char on both ends:
(?<!\w)(word1|word2)(?!\w)
The (?<!\w)
and (?!\w)
lookarounds are unambiguous leading ((?<!\w)
) and trailing ((?!\w)
) word boundaries.
The \b
construct meaning depends on the context: \bw
will match a w
in *w
as it will require a non-word character before \b
, but \b\*
will require a word character before *
as *
is a non-word character.
In languages that do not support lookbehinds, the (?<!\w)
should be replaced with (^|\W)
and further manipulations should be done in the code.