I have this problem where i'm trying to validate an input using regex. the input should correspond to a specific programmation-like norm, that consist on function, variables (2 types) and strings.
Here's some examples :
&foo
$foo(&bar)
$foo(&bar+$baz)
$foo($bar "baz" qux+$quux(&corge "grault") &garply)
Types that exists:
The main problem here is that I've written a regex, with recusion of course. But, I need to validate the fact that, in the final match itself or within its recursions, there is at least one variable of the second type - which isn't actually the case. Obviously, my regex should fullmatch the input.
So here's my regex (it uses PCRE2), followed by its explanation, and a suit of tests for you to try it:
(?:\$\w+(?:\((?<arg>(?R)|(?:\"[^\"]*\")|(?:[^\$\"\&\s\(\)][^\s\(\)]*))(?:(?:\+| )(?&arg))*\))?)|(?<var>\&\w+)
(?: # function or variable and its arguments
\$\w+ # function or variable prefix + name (ex: $test)
(?: # arguments if it's a function
\( # opening parenthesis
(?<arg> # first argument
(?R) # function or variable
|
(?:\"[^\"]*\") # string
|
(?:[^\$\"\&\s\(\)][^\s\(\)]*) # string literal
)
(?: # other arguments if any
(?:\+| ) # separator
(?&arg) # argument
)*
\) # closing parenthesis
)?
)
|
(?<var>\&\w+) # variable prefix + name (ex: &test)
( # function or variable and its arguments
\$\w+ # function or variable prefix + name (ex: $test)
MAYBE ( # arguments if it's a function
\( # opening parenthesis
GROUP <arg> ( # first argument
RECURSIVE # function or variable
OR
\"[^\"]*\" # string
OR
[^\$\"\&\s\(\)][^\s\(\)]* # string literal
)
MAYBE MULTIPLE ( # other arguments if any
\+ OR SPACE # separator
GROUP arg # argument
)
\) # closing parenthesis
)
)
OR
GROUP <var> (\&\w+) # variable prefix + name (ex: &test)
The goal is to transform it the verify that the group var
is present at least once, within the regex or its recursions.
The best solution would be that (?R)
kept the group matched within recursion and pass them to his parent pattern, I would be able to check if the group var
matched at least once with (?(var)(*ACCEPT)|(*FAIL))
.
Here's a simplified version of what I'm thinking about: (?:\w(?R)|(\d))(?(1)(*ACCEPT)|(*FAIL))
. This regex only matches 1 digit, but with "recursion matches-keeping", it whould match any chain of letters followed by a digit.
However, it seems that it's not possible at all. I didn't found a flag nor a token for this.
$test($test(test &var1 $var2) $test(jean) jean) # should full-match
$test($test(test $var1 $var2) $test(foo) bar) # should not full-match
$test("&test") # should not full-match
$test("foo&test") # should not full-match
$test("&test" &test) # should full-match
&test # should full-match
$test(" &test") # should not full-match
$test # should not full-match
&test(foo) # should not full-match
&test(&foo $bar()) # should not full-match
$test((&test)) # should not full-match
$foo(bar(&baz)) # should not full match
$test(&test &test) # should not full match
$test( &test) # should not full match
$test(&test ) # should not full match
Here's a link to my regex101 tests for you to try and test it.
So, after searching I came up with a library-specific solution. Not a regex one, sorry if you have the same problem!
For more context, I was using Python with the great mrab-regex
(or simply regex
) library (and NOT re
as it doesn't support recursive regexes among other few things).
And, this library has a simple method to get a list of all the successful matches of a repeated group, that is captures
. I simply used it right after my regex match to check if the regex capturing group var
was empty:
for match in my_regex.finditer(cell):
if not match.captures('var'):
continue
...
More information about the mrab-regex
library and its methods for handling repeated captures.