phpregexphp-7.2php-7.3

Why does this regular expression fail with PCRE (PHP < 7.3) but works with PCRE2 (PHP >= 7.3)


The regular expression:

/(?<nn>(?!und)[^\/,&;]+)(?:,\s?+)(?<vn>(?1))(?:\/|&|;|und|$)\s?/

is supposed to result in two matches using preg_match_all

nn(1): Oidtmann-van Beek
vn(1): Jeanne 

nn(2): Oidtmann
vn(2): Peter

On the sample string Oidtmann-van Beek, Jeanne und Oidtmann, Peter

This works with PCRE2 (PHP >= 7.3).

But not with PHP < 7.3, why?

https://regex101.com/r/zotHZN/1/


Solution

  • You do not get the expected output with PCRE because the (?1) regex subroutine is atomic, and its pattern cannot be backtracked into.

    See "Differences in recursion processing between PCRE2 and Perl":

    Before release 10.30, recursion processing in PCRE2 differed from Perl in that a recursive subroutine call was always treated as an atomic group. That is, once it had matched some of the subject string, it was never re-entered, even if it contained untried alternatives and there was a subsequent matching failure. (Historical note: PCRE implemented recursion before Perl did.)

    Starting with release 10.30, recursive subroutine calls are no longer treated as atomic. That is, they can be re-entered to try unused alternatives if there is a matching failure later in the pattern. This is now compatible with the way Perl works. If you want a subroutine call to be atomic, you must explicitly enclose it in an atomic group.

    So, the solution is to use the pattern itself, not the subroutine:

    /(?<nn>(?!und)[^\/,&;]+),\s?+(?<vn>(?!und)[^\/,&;]+)(?:\/|&|;|und|$)\s?/
    

    Note I replaced (?:,\s?+) with just \s?+ as the non-capturing group is redundant here.

    I think a more precise pattern like /\b(?<nn>(?!und\b)\w+(?:[-'\s]+(?!und\b)\w+)*),\s?(?<vn>(?&nn))\b/u would be better here. See this regex demo. It will not require any backtracking as the \w+(?:[-'\s]+(?!und\b)\w+)* part does not overlap with the ,\s? pattern.