javaregexboundaryword-boundaries

What are the most efficient ways to write regex with boundary matching in Java?


So I found out that the word boundary works great to make sure that exactly that word is being found within the text and that we don't cut other words if they contain just parts of this word, however I noticed it works bad at the String start and end.

So ideally I would expect a regex like this also work well in string start and end, because that's where the word also starts/ends:

String regex1 = "\\b" + searchedWord + "\\b";

However it turned out I had to transform the regex like this to make sure it works well also for string start and end:

String regex2 = "(^|\\b)" + searchedWord + "($|\\b)";

I haven't discovered any side effects of using the latter regex yet, however I would like to know if there is any special boundary or how to write the boundary more efficiently to make it less ugly and less counter-intuitive.

Does anybody know better ways? Perhaps you can also improve my suggested regex as well in case you are aware of any problems using it.


Solution

  • If the first and last characters of your searchWord are word chars, there can be no side effects.

    "Side" effects may only appear if the characters on either end are non-word characters.

    Now, \b may match in 4 positions: between string start and a word char, between a non-word and word chars, between word and non-word chars, and between a word char and the end of string. If you need to make sure there is no word char before the searchWord, you may use an unambiguous (?<!\w) negative lookbehind and to make sure there is no word char after the word, you may use (?!\w) negative lookahead.

    Also remember that \b, same as \w, are not by itself Unicode aware. Add the Pattern.UNICODE_CHARACTER_CLASS flag or (?U):

    String regex1 = "(?U)(?<!\\w)" + searchedWord + "(?!\\w)";
    

    Other ways often include making sure there are spaces around (or at the start/end of string) with

    String regex1 = "(?U)(?<!\\S)" + searchedWord + "(?!\\S)";
    

    This will not match right before or right after punctuation though.