It seems a little strange to me that \w
matches [a-zA-Z0-9_]
. I wonder why 0-9
and _
are counted between word characters and why -
is not counted between word characters.
If I want to split the sentence:
This is counter-example.
with (\w*\b)
it will split the word counter-example to two parts. Similarly (count.*?\b)
matches only counter
.
Would it be possible to have something like \b
with the result that -
is included in word characters (\w
)?
Or did I misunderstood the usage of \b
? Are there some examples of standard usage of this?
The fact that \w
matches the underscore along with uppercase and lowercase letters is historical: it is due to the fact that it was first introduced to match C identifiers.
Well, this is true for Java's \w
(yes, \w
will not match accentuated characters in Java).
\b
however is an anchor, and it is not defined by the frontier between what is a word character and a non word character, in fact it is implementation-dependent.
There is not really an anchor which does what you want, but if you want to match words and dashes, your best bet is \w*(-\w*)*
.
Again, the normal* (special normal*)*
pattern!
(and BTW, \b
is a "word anchor" in some dialects only, other implementations define \<
and \>
instead for the beginning and end of word anchors respectively)
[edit for a gross error]