regexraku

Why/how is an additional variable needed in matching repeated arbitary character with capture groups?


I'm matching a sequence of a repeating arbitrary character, with a minimum length, using a perl6 regex.

After reading through https://docs.perl6.org/language/regexes#Capture_numbers and tweaking the example given, I've come up with this code using an 'external variable':

#uses an additional variable $c
perl6 -e '$_="bbaaaaawer"; /((.){} :my $c=$0; ($c)**2..*)/ && print $0';

#Output:  aaaaa

To aid in illustrating my question only, a similar regex in perl5:

#No additional variable needed
perl -e ' $_="bbaaaaawer"; /((.)\2{2,})/ && print $1';

Could someone enlighten me on the need/benefit of 'saving' $0 into $c and the requirement of the empty {}? Is there an alternative (better/golfed) perl6 regex that will match?

Thanks in advance.


Solution

  • Option #1: Don't sub-capture a pattern that includes a back reference

    $0 is a back reference1.

    If you omit the sub-capture around the expression containing $0, then the code works:

    $_="bbaaaaawer"; / (.) $0**2..* / && print $/; # aaaaa
    

    Then you can also omit the {}. (I'll return to why you sometimes need to insert a {} later in this answer.)


    But perhaps you wrote a sub-capture around the expression containing the back reference because you thought you needed the sub-capture for some other later processing.

    There are often other ways to do things. In your example, perhaps you wanted a way to be able to count the number of repeats. If so, you could instead write:

    $_="bbaaaaawer";
    / (.) $0**2..* /;
    print $/.chars div $0.chars; # 5
    

    Job done, without the complications of the following sections.

    Option #2. Sub-capture without changing the current match object during matching of the pattern that includes a back reference

    Maybe you really need to sub-capture a match of an expression that includes a back reference.

    This can still be done without needing to surround the $0 with a parens sub-capture. The alternative technique shown in this section saves the problems discussed in the sections above and below.

    You can use this technique if the expression isn't too complicated or if you don't need to have named sub-sub-captures of the expression:

    $_="bbaaaaawer";
    / (.) $<capture-when-done>=$0**2..* /;
    print $<capture-when-done>.join; # aaaa
    

    This sub-captures the result of matching the expression (in a named capture) but avoids inserting an additional sub-capturing context around the expression (which is what causes the complications discussed in the previous and next sections).

    While this technique will work for the expression in your question ($0**2..*) it won't if an expression is complex enough to need grouping and you need named sub-captures of the expression. This is because the syntax $<foo>=[...] disables any further named sub-sub-capturing (within the ...).

    Perhaps this is fixable without hurting performance or causing other problems.2

    Option #3. Use a saved back reference inside a sub-capture

    Finally we arrive at the technique you've used in your question.

    Automatically available back references to sub-captures (like $0) cannot refer to sub-captures that happened outside the sub-capture they're written in. Update See "I'm (at least half) wrong!" note below.

    So if, for any reason, you have to create a sub-capture (using either (...) or <...>) then you must manually store a back reference in a variable and use that instead.

    Before we get to a final section explaining in detail why you must use a variable, let's first complete an initial answer to your question by covering the final wrinkle.

    {} forces "publication" of match results thus far

    The {} is necessary to force the :my $c=$0; to update each time it's reached using the current regex/grammar engine. If you don't write it, then the regex engine fails to update $c to a capture of 'a' and instead leaves it stuck on a capture of 'b'.

    Please read "Publication" of match variables by Rakudo.

    Why can't a sub-capture include a back reference to captures that happened outside that sub-capture?

    First, you have to take into account that matching in P6 is optimized for the nested matching case syntactically, semantically, and implementation wise.

    In particular, if, when writing a regex or grammar, you write a numbered capture (with (...)), or a named rule/capture (with <foo>), then you've inserted a new level in a tree of sub-patterns that are dynamically matched/captured at run-time.

    See jnthn's answer for why and Brad's for some discussion of details.


    What I'll add to those answers is a (rough!) analogy, and another discussion of why you have to use a variable and {}.

    The analogy begins with a tree of sub-directories in a file system:

    /
      a
      b
        c
        d
    

    The analogy is such that:

    If file system navigation didn't support these back references towards the root then one thing to do would be to create an environment variable that stored a particular path. That's roughly what saving a capture in a variable in a P6 regex is doing.

    The central issue is that a lot of the machinery related to regexes is relative to "the current match". And this includes $/, which refers to the current match and back references like $0, which are relative to the current match. Update See "I'm (at least half) wrong!" note above.


    Thus, in the following, which is runnable via tio.run here, it's easy to display 'bc' or 'c' with a code block inserted in the third pair of parens...

    $_="abcd";
    m/ ( ( . ) ( . ( . ) { say $/ } ( . ) ) ) /; # 「bc」␤ 0 => 「c」␤
    say $/;                                      # 「abcd」␤ etc.
    

    ...but it's impossible to refer to the captured 「a」 in that third pair of parens without storing 「a」's capture in a regular variable. Update See "I'm (at least half) wrong!" note above.

    Here's one way of looking at the above match:

      ↓ Start TOP level $/
    m/ ( ( . ) ( . ( . ) { say $/ } ( . ) ) ) /; # captures 「abcd」
    
        ↓ Start first sub-capture; TOP's $/[0]
       (                                    )    # captures 「abcd」
    
          ↓ Start first sub-sub-capture; TOP's $/[0][0]
         ( . )                                   # captures 「a」
        
                ↓ Start *second* sub-sub-capture; TOP's $/[0][1]
               (                          )      # captures 「bcd」
    
                    ↓ Start sub-sub-sub-capture; TOP's $/[0][1][0]
                   ( . )                         # captures 「c」
    
                         { say $/ }              # 「bc」␤ 0 => 「c」␤
    
                                     ( . )       # captures 'd'
    

    If we focus for a moment on what $/ refers to outside of the regex (and also directly inside the /.../ regex, but not inside sub-captures), then that $/ refers to the overall Match object, which ends up capturing 「abcd」. (In the filesystem analogy this particular $/ is the root directory.)

    The $/ inside the code block inside the second sub-sub-capture refers to a lower level match object, specifically the one that, at the point the say $/ is executed, has already matched 「bc」 and will go on to have captured 「bcd」 by the end of the overall match.

    But there's no built in way to refer to the sub-capture of 'a', or the overall capture (which at that point would be 'abc'), from within the sub-capture surrounding the code block. Update See "I'm (at least half) wrong!" note above.

    Hence you have to do something like what you've done.

    A possible improvement?

    What if there were a direct analog in P6 regexes for specifying the root? Update See "I'm (at least half) wrong!" note above.

    Here's an initial cut at this that might make sense. Let's define a grammar:

    my $*TOP;
    grammar g {
      token TOP { { $*TOP := $/ } (.) {} <foo> }
      token foo { <{$*TOP[0]}> }
    }
    say g.parse: 'aa' # 「aa」␤ 0 => 「a」␤ foo => 「a」
    

    So, perhaps a new variable could be introduced, one that's read only for userland code, that's bound to the overall match object during a match operation. Update See "I'm (at least half) wrong!" note above.

    But then that's not only pretty ugly (unable to use a convenient short-hand back reference like $0) but refocuses attention on the need to also insert a {}. And given that it would presumably be absurdly expensive to republish all the tree of match objects after each atom, one is brought full circle back to the current status quo. Short of the fixes mentioned in this answer, I think what is currently implemented is as good as it's likely to get.

    Footnotes

    1 The current P6 doc doesn't use the conventional regex term "back reference" but $0, $1 etc. are numbered P6 back references. The simplest explanation I've seen of numbered back references is this SO about them using a different regex dialect. In P6 they start with $ instead of \ and are numbered starting from 0 rather than 1. The equivalent of \0 in other regex dialects is $/ in P6. In addition, $0 is an alias for $/[0], $1 for $/[1], etc.

    2 The $<bar>=wer below will match but fail to capture:

    $_="bbaaaaawer";
    / (.) $<foo>=[$0**2..* $<bar>=wer] /;
    say $<foo>;        # 「aaaawer」
    say $<foo><bar>;   # Nil
    

    It seems that [...] doesn't mean "group, but don't insert a new capture level like (...) and <...> do" but instead "group, and disable named sub-capturing using the $<name>=... syntax". Perhaps this can reasonably be fixed and perhaps it should be fixed.

    3 The current "match variable" doc says:

    $/ is the match variable. It stores the result of the last Regex match and so usually contains objects of type Match.

    (Fwiw $/ contains a List of Match objects if an adverb like :global or :exhaustive is used.)

    The above description ignores a very important use case for $/ which is its use during matching, in which case it contains the results so far of the current regex.

    Following our file system analogy, $/ is like the current working directory -- let's call it "the current working match object" aka CWMO. Outside a matching operation the CWMO ($/) is ordinarily the completed result of the last regex match or grammar parse. (I say "ordinarily" because it's writable so code can change it with as little as $/ = 42.) During matching (or actions) operations the CWMO is read-only for userland code and is bound to a Match object generated by the regex/grammar engine for the current match or action rule/method.