regexperlregex-recursion

Capturing text before and after a C-style code block with a Perl regular expression


I am trying to capture some text before and after a C-style code block using a Perl regular expression. So far this is what I have:

use strict;
use warnings;

my $text = << "END";
int max(int x, int y)
{
    if (x > y)
    {
        return x;
    }
    else
    {
        return y;
    }
}
// more stuff to capture
END

# Regex to match a code block
my $code_block = qr/(?&block)
(?(DEFINE)
    (?<block>
        \{                # Match opening brace
            (?:           # Start non-capturing group
                [^{}]++   #     Match non-brace characters without backtracking
                |         #     or
                (?&block) #     Recursively match the last captured group
            )*            # Match 0 or more times
        \}                # Match closing brace
    )
)/x;

# $2 ends up undefined after the match
if ($text =~ m/(.+?)$code_block(.+)/s){
    print $1;
    print $2;
}

I am having an issue with the 2nd capture group not being initialized after the match. Is there no way to continue a regular expression after a DEFINE block? I would think that this should work fine.

$2 should contain the comment below the block of code but it doesn't and I can't find a good reason why this isn't working.


Solution

  • Capture groups are numbered left-to-right in the order they occur in the regex, not in the order they are matched. Here is a simplified view of your regex:

    m/
      (.+?)  # group 1
      (?:  # the $code_block regex
        (?&block)
        (?(DEFINE)
          (?<block> ... )  # group 2
        )
      )
      (.+)  # group 3
    /xs
    

    Named groups can also be accessed as numbered groups.

    The 2nd group is the block group. However, this group is only used as a named subpattern, not as a capture. As such, the $2 capture value is undef.

    As a consequence, the text after the code-block will be stored in capture $3.

    There are two ways to deal with this problem: