regexpcre

How to keep matched groups of pattern recursion in regex?


I have this problem where i'm trying to validate an input using regex. the input should correspond to a specific programmation-like norm, that consist on function, variables (2 types) and strings.

Here's some examples :

&foo
$foo(&bar)
$foo(&bar+$baz)
$foo($bar "baz" qux+$quux(&corge "grault") &garply)

Types that exists:

The main problem here is that I've written a regex, with recusion of course. But, I need to validate the fact that, in the final match itself or within its recursions, there is at least one variable of the second type - which isn't actually the case. Obviously, my regex should fullmatch the input.

So here's my regex (it uses PCRE2), followed by its explanation, and a suit of tests for you to try it:

Regex

(?:\$\w+(?:\((?<arg>(?R)|(?:\"[^\"]*\")|(?:[^\$\"\&\s\(\)][^\s\(\)]*))(?:(?:\+| )(?&arg))*\))?)|(?<var>\&\w+)

Explanations

(?:                           # function or variable and its arguments
  \$\w+                         # function or variable prefix + name (ex: $test)
  (?:                           # arguments if it's a function
    \(                            # opening parenthesis
    (?<arg>                       # first argument
      (?R)                          # function or variable
      |
      (?:\"[^\"]*\")                # string
      |
      (?:[^\$\"\&\s\(\)][^\s\(\)]*) # string literal
    )
    (?:                           # other arguments if any
      (?:\+| )                      # separator
      (?&arg)                       # argument
    )*
    \)                            # closing parenthesis
  )?
)
|
(?<var>\&\w+)                 # variable prefix + name (ex: &test)

Human-readable translation

(                         # function or variable and its arguments
  \$\w+                     # function or variable prefix + name (ex: $test)
  MAYBE (                   # arguments if it's a function
    \(                        # opening parenthesis
    GROUP <arg> (             # first argument
      RECURSIVE                 # function or variable
      OR
      \"[^\"]*\"                # string
      OR
      [^\$\"\&\s\(\)][^\s\(\)]* # string literal
    )
    MAYBE MULTIPLE (          # other arguments if any
      \+ OR SPACE               # separator
      GROUP arg                 # argument
    )
    \)                        # closing parenthesis
  )
)
OR
GROUP <var> (\&\w+)       # variable prefix + name (ex: &test)

The goal is to transform it the verify that the group var is present at least once, within the regex or its recursions.

The best solution would be that (?R) kept the group matched within recursion and pass them to his parent pattern, I would be able to check if the group var matched at least once with (?(var)(*ACCEPT)|(*FAIL)).

Here's a simplified version of what I'm thinking about: (?:\w(?R)|(\d))(?(1)(*ACCEPT)|(*FAIL)). This regex only matches 1 digit, but with "recursion matches-keeping", it whould match any chain of letters followed by a digit.

However, it seems that it's not possible at all. I didn't found a flag nor a token for this.

Test/Examples

$test($test(test &var1 $var2) $test(jean) jean) # should full-match
$test($test(test $var1 $var2) $test(foo) bar)   # should not full-match
$test("&test")                                  # should not full-match
$test("foo&test")                               # should not full-match
$test("&test" &test)                            # should full-match
&test                                           # should full-match
$test(" &test")                                 # should not full-match
$test                                           # should not full-match
&test(foo)                                      # should not full-match
&test(&foo $bar())                              # should not full-match
$test((&test))                                  # should not full-match
$foo(bar(&baz))                                 # should not full match
$test(&test  &test)                             # should not full match
$test( &test)                                   # should not full match
$test(&test )                                   # should not full match

Here's a link to my regex101 tests for you to try and test it.


Solution

  • So, after searching I came up with a library-specific solution. Not a regex one, sorry if you have the same problem!

    For more context, I was using Python with the great mrab-regex (or simply regex) library (and NOT re as it doesn't support recursive regexes among other few things).

    And, this library has a simple method to get a list of all the successful matches of a repeated group, that is captures. I simply used it right after my regex match to check if the regex capturing group var was empty:

    for match in my_regex.finditer(cell):
       if not match.captures('var'):
          continue
       ...
    

    More information about the mrab-regex library and its methods for handling repeated captures.