regexpcre

order of word boundaries and anchors in PCRE


Are the following couples of expressions equivalent in PCRE?

  1. For ^ : \b^<some-regex>\b and ^\b<some-regex>\b (e.g.: \b^[a-z]\b and ^\b[a-z]\b)

  2. For $ : \b<some-regex>$\b and \b<some-regex>\b$ (e.g.: \b[a-z]$\b and \b[a-z]\b$)

  3. For the combination of ^ and $ :

I tested all of the options above and couldn't find any difference in matching. If they're not equivalent - please give an example input that matches one but not the other.


Solution

  • Anchors and word boundaries are non-consuming patterns. That means, that the regex index stays at the same position inside the string after evaluating the anchor or a word boundary.

    In a \b$ pattern, the regex engine ensures the current position is a word boundary position, and, staying at the same position, also checks if it is the end of the string.

    In a $\b pattern, the regex engine first ensures the current position is the end of the string, and then, staying at the same position, also checks if it is the word boundary position.

    So, \b$ equals $\b.

    The same applies to ^\b and \b^ (where ^ matches the start of a string position).

    You might have heard that lookarounds are non-consuming, and yes, that is true. \b can actually be paraphrased as a (?<!\w)(?=\w)|(?<=\w)(?!\w) lookaround alternation. ^ and $ are trickier, but you must understand that ^ = (?=^) and $ equals (?=$).