phpregexrecursionpcre

Recursive regex pattern not matching all characters as expected (PHP differs from Perl implementation)


EDIT: I selected ridgerunner's answer as it contained the information needed to solve the problem. But I also felt like adding a fully fleshed-out solution to the specific question in case someone else wants to fully understand the example too. You will find it somewhere below.

This question is about clarifying the behavior of php's regex engine for recursive expressions. (If you ideas for how to properly match the strings below without using recursive php regex, that's very cool, but that's not the question.)

a(?:(?R)|a?)a

This is a simple expression that aims to match the character "a" or nothing, nested in one or multiple nests of the character "a". For instance, aa, aaa, aaaa, aaaaa. You don't need to use recursion for this:

aa*a

would work great. But the point is to use recursion.

Here is a piece of code you can run to test my failing pattern:

<?php
$tries=array('a','aa','aaa','aaaa','aaaaa','aaaaaa');
$regex='#a(?:(?R)|a?)a#';
foreach ($tries as $try) {
echo $try." : ";
if (preg_match($regex,$try,$hit)) echo $hit[0]."<br />";
else echo 'no match<br />';
}
?>

In the pattern, two "a"s are framing an alternation. In the alternation, we either match a recursion of the whole pattern (two "a"s framing an alternation), or the character "a", optionally empty.

In my mind, for "aaaa", this should match "aaaa".

But here is the output:

a : no match
aa : aa
aaa : aaa
aaaa : aaa
aaaaa : aaaaa
aaaaaa : aaa

Can someone explain what is happening on the third and fifth lines of output? I have tried tracing the path that I imagine the engine must be taking, but I must be imagining it wrong. Why is the engine returning "aaa" as a match for "aaaa"? What makes it so eager? I must be imagining the matching tree in the wrong order.

I realise that

#(?:a|a(?R)a)*#

kind of works, but my question is why the other pattern doesn't.


Solution

  • Excellent (and difficult) question!

    First, with the PCRE regex engine, the (?R) behaves like an atomic group (unlike Perl?). Once it matches (or doesn't match), the matching that happened inside the recursive call is final (and all backtracking breadcrumbs saved within the recursive call are discarded). However, the regex engine does save what was matched by the whole (?R) expression, and can give it back and try the other alternative to achieve an overall match. To describe what is happening, lets change your example slightly so that it will be easier to talk about and keep track of what is being matched at each step. Instead of: aaaa as the subject text, lets use: abcd. And lets change the regex from '#a(?:(?R)|a?)a#' to: '#.(?:(?R)|.?).#'. The regex engine matching behavior is the same.

    Matching regex: /.(?:(?R)|.?)./ to: "abcd"

    answer = r'''
    Step Depth Regex          Subject  Comment
    1    0     .(?:(?R)|.?).  abcd     Dot matches "a". Advance pointers.
               ^              ^
    2    0     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 1).
                     ^         ^
    3    1     .(?:(?R)|.?).  abcd     Dot matches "b". Advance pointers.
               ^               ^
    4    1     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 2).
                     ^          ^
    5    2     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
               ^                ^
    6    2     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 3).
                     ^           ^
    7    3     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
               ^                 ^
    8    3     .(?:(?R)|.?).  abcd     Try 1st alt. Recursive call (to depth 4).
                     ^            ^
    9    4     .(?:(?R)|.?).  abcd     Dot fails to match end of string.
               ^                  ^    DEPTH 4 (?R) FAILS. Return to step 8 depth 3.
                                       Give back text consumed by depth 4 (?R) = ""
    10   3     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches EOS.
                        ^         ^    Advance regex pointer.
    11   3     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    DEPTH 3 (?R) FAILS. Return to step 6 depth 2
                                       Give back text consumed by depth3 (?R) = "d"
    12   2     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "d".
                        ^        ^     Advance pointers.
    13   2     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    Backtrack to step 12 depth 2
    14   2     .(?:(?R)|.?).  abcd     Match zero "d" (give it back).
                        ^        ^     Advance regex pointer.
    15   2     .(?:(?R)|.?).  abcd     Dot matches "d". Advance pointers.
                           ^     ^     DEPTH 2 (?R) SUCCEEDS.
                                       Return to step 4 depth 1
    16   1     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    Backtrack to try other alternative. Give back
                                        text consumed by depth 2 (?R) = "cd"
    17   1     .(?:(?R)|.?).  abcd     Optional dot matches "c". Advance pointers.
                        ^       ^      
    18   1     .(?:(?R)|.?).  abcd     Required dot matches "d". Advance pointers.
                           ^     ^     DEPTH 1 (?R) SUCCEEDS.
                                       Return to step 2 depth 0
    19   0     .(?:(?R)|.?).  abcd     Required dot fails to match end of string.
                           ^      ^    Backtrack to try other alternative. Give back
                                        text consumed by depth 1 (?R) = "bcd"
    20   0     .(?:(?R)|.?).  abcd     Try 2nd alt. Optional dot matches "b".
                        ^      ^       Advance pointers.
    21   0     .(?:(?R)|.?).  abcd     Dot matches "c". Advance pointers.
                           ^    ^      SUCCESSFUL MATCH of "abc"
    '''
    

    There is nothing wrong with the regex engine. The correct match is abc (or aaa for the original question.) A similar (albeit much longer) sequence of steps can be made for the other longer result string in question.