The Best Regex Trick is about writing regexes that match r1
but not r2
. The example they give is a regex that matches Tarzan
(and "Tarzan and Jane"
) but not "Tarzan"
. After going through some things that don't work, they give the "best regex trick ever":
"Tarzan"|(Tarzan)
This supposedly matches the "bad string" first, skipping the good string, but not including the bad string in a capture group. If only the good string appears, we match on it last and include it in the capture group.
One downside of the "best regex trick" is that this still matches "Tarzan"
, even if it doesn't capture it. You can't eg use it in a conditional without some extra boilerplate?
This is based on PCRE-style regexes. Raku uses an entirely different regex notation. Is it possible to do the trick more simply? Ideally this should be possible:
> ('"Tarzan"', 'Tarzan', '"Tarzan and Jane"') <<~~>> /some-regex/
(Nil 「Tarzan」 「Tarzan」)
say (« '"Tarzan"' Tarzan '"Tarzan and Jane"' » «~~» /'"Tarzan"' | (Tarzan)/)»[0]
# (Nil 「Tarzan」 「Tarzan」)
Discussion of the code, reading from right to left:
What is [0]
? A list "subscript". It's a reference to the zeroth element of the list on its left. Unlike in older regex dialects, in Raku, positional captures are numbered from zero, not one. So the captures associated with the (Tarzan)
positional sub-capture in the regex are stored as positional sub-capture zero of an overall match instead of positional sub-capture one.
What is »[0]
? A hyperoperation. For our purposes here it's like running a traditional map
function that, for each element on the left (which will be an overall regex match of one of the strings on the far left), yields the zeroth positional sub-element (which will be the zeroth, i.e. first, positional sub capture).
Why have I written '"Tarzan"'
in the regex instead of "Tarzan"
? Unlike in older regex dialects, _in Raku, code of the form "foo"
in a regex represents the three character string foo
, not the five character string "foo"
. One way to represent the latter is to write '"foo"'
.
What is «~~»
? Another hyperoperation, like the postfix/unary »[0]
one described above, but this time an infix/binary one with operands on both sides. (The original question included the same hyperoperation.)
What is « '"Tarzan"' Tarzan '"Tarzan and Jane"' »
? A list of strings. Raku has a family of list-of-strings constructors; « ... »
in term position (as against operator position) is its "word" quoting with quote protection list constructor.
is it possible to do the trick more simply?
A step that's arguably in the right direction is:
say « '"Tarzan"' Tarzan '"Tarzan and Jane"' » «~~» / '"Tarzan"' <()> | <(Tarzan)> /
# (「」 「Tarzan」 「Tarzan」)
Notes:
<(...)>
explicitly delimits the top level capture, narrowing it from the default top level capture (which captures everything that matches).<()>
means the top level capture is empty. That's not Nil
, but could suffice depending on your needs.<(Tarzan)>
specifies the top level capture (in contrast to (Tarzan)
which specifies the first sub-capture).The original specification for Raku regexes included some relevant features that have not yet been implemented. It's possible they may one day be implemented and provide a simpler solution.
See also @tshiono's answer.
See also @alatennaub's excellent answer on reddit.