regexperl

Understanding alternation branches and the "branch reset" extension


When using alternation with captures, I've often been somewhat puzzled by how it actually works, in particular why I would often get void (undef) match entries which I then have to discard by filtering the list using grep, which doesn't feel very "DWIMmish".

However, by trial and error, I found it that it "does what I want" (accidentally?) using the "branch reset" feature.

Here's a script to illustrate the case:

use v5.12; # contrived RE example
# understanding alternation, branches, clustering, capturing
$_ = q(eins zwei drei vier fuenf sechs sieben);
my @m = m/(?:[a-z](ei)|(ie|eu|au))/g; # (?:…|…) plain clustering
my @n = m/(?|[a-z](ei)|(ie|eu|au))/g; # (?|…|…) "branch reset" extension
say for 'plain clustering w/o "branch reset"', @m, scalar @m;
say for  '', 'clustering with "branch reset"', @n, scalar @n;

Running this script (using 5.36.0 in case it matters), you'll see that both match the same five places, but the one w/o "branch reset" matches an annoying additional five places which are empty (actually, undef, by adding enabling warnings). Why does it do that? How can we conceptualize how the engine goes about its job?

By running with perl -Mre=debug, you'll get an analysis of what's going on under the hood. The final program (i.e. compiled RE) w/o "branch reset" extension reads as follows:

 1: BRANCH (9)
 2:   POSIXA[:lower:] (3)
 3:   OPEN1 (5)
 5:     EXACT <ei> (7)
 7:   CLOSE1 (24)
 9: BRANCH (FAIL)
10:   OPEN2 (12)
12:     TRIE-EXACT[aei] (21)
        <ie>
        <eu>
        <au>
21:   CLOSE2 (24)
23: TAIL (24)
24: END (0)

The final program with "branch reset" extension enabled results in the same listing, except there is now OPEN1 in line 10 and CLOSE1 in line 21.

If you redirect STDERR to a file and diff them, the only difference is the usage of OPEN1 and CLOSE1 for "branch reset" in several places.

So that doesn't really explain why I get these additional undef captures.

(I've read that it might be best to avoid alternation altogether, but that is not the point, and alternation might have advantages when the focus is on avoiding redundancy, for example.)


Solution

  • To illustrate what was happening with your code, I wrote this variation, where I pull out the matches while they happen, and print the context. Note the use of $1$2 in the print, printing the first two captures from the match.

    use strict;
    use warnings;
    use feature 'say';
    
    $_ = q(eins zwei drei vier fuenf sechs sieben);
    say "Regular";
    while (m/(?:[a-z](ei)|(ie|eu|au))/g) {
            say $` . ">$1$2<" . $';  # $` and $' and pre/post match
    }
    say "Branch reset";
    while (m/(?|[a-z](ei)|(ie|eu|au))/g) {
            say $` . ">$1$2<" . $';
    }
    

    Output:

    Regular
    Use of uninitialized value $2 in concatenation (.) or string at foo.pl line 10.
    eins z>ei< drei vier fuenf sechs sieben
    Use of uninitialized value $2 in concatenation (.) or string at foo.pl line 10.
    eins zwei d>ei< vier fuenf sechs sieben
    Use of uninitialized value $1 in concatenation (.) or string at foo.pl line 10.
    eins zwei drei v>ie<r fuenf sechs sieben
    Use of uninitialized value $1 in concatenation (.) or string at foo.pl line 10.
    eins zwei drei vier fuenf sechs s>ie<ben
    Branch reset
    Use of uninitialized value $2 in concatenation (.) or string at foo.pl line 14.
    eins z>ei< drei vier fuenf sechs sieben
    Use of uninitialized value $2 in concatenation (.) or string at foo.pl line 14.
    eins zwei d>ei< vier fuenf sechs sieben
    Use of uninitialized value $2 in concatenation (.) or string at foo.pl line 14.
    eins zwei drei v>ie<r fuenf sechs sieben
    Use of uninitialized value $2 in concatenation (.) or string at foo.pl line 14.
    eins zwei drei vier fuenf sechs s>ie<ben
    

    We can see that in the "Branch reset" section, $2 is always undefined, while in the "Regular" section, $2 is undefined up until $1 is undefined. When $1 is undef, $2 is printed instead. That is the second alternation matching.

    Note that in perldoc perlre we can read about the branch reset pattern (?|...:

    (?|pattern)

    This is the "branch reset" pattern, which has the special property that the capture groups are numbered from the same starting point in each alternation branch. It is available starting from perl 5.10.0.

    Capture groups are numbered from left to right, but inside this construct the numbering is restarted for each branch.

    The numbering within each branch will be as normal, and any groups following this construct will be numbered as though the construct contained only one branch, that being the one with the most capture groups in it.

    This construct is useful when you want to capture one of a number of alternative matches.

    Consider the following pattern. The numbers underneath show in which group the captured content will be stored.

        # before  ---------------branch-reset----------- after
        / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
        # 1            2         2  3        2     3     4
    

    So the short version is that when you do not use branch reset, you also capture the failed match in the first alternation when the second matches. In branch reset, there is no empty captures, as upon failure, it resets and goes to the next alternation.

    How to fix it? Well, you can put the capture around the alternations.

    while (m/((?<=[a-z])ei|(?:ie|eu|au))/g) { 
    #        ^---- capture parens -----^
            say $` . ">$1<" . $';  
    }
    

    Instead of [a-z] we use a zero-width lookbehind assertion (?<=[a-z]) that will not be included in the capture, and put the main parentheses around the entire section.

    The output for this code is:

    eins zw>ei< drei vier fuenf sechs sieben
    eins zwei dr>ei< vier fuenf sechs sieben
    eins zwei drei v>ie<r fuenf sechs sieben
    eins zwei drei vier fuenf sechs s>ie<ben