[SOLVED] Raku regex: Inconsistent longest token matching

Raku regex: Inconsistent longest token matching

Raku's regexes are expected to match longest token.

And in fact, this behaviour is seen in this code:

raku -e "'AA' ~~ m/A {say 1}|AA {say 2}/"
# 2

However, when the text is in a variable, it does not seem to work in the same way:

raku -e "my $a = 'A'; my $b = 'AA'; 'AA' ~~ m/$a {say 1}|$b {say 2}/"
# 1

Why they work in a different way? Is there a way to use variables and still match the longest token?

Solution

There are two things at work here.

The first is the meaning of "longest token". When there is an alternation (using | or implied by use of proto regexes), the declarative prefix of each branch is extracted. Declarative means the subset of the Raku regex language that can be matched by a finite state machine. The declarative prefix is determined by taking regex elements until a non-declarative element is encountered. You can read more and find some further references in the docs.

To understand why things are this way, a small detour may be helpful. A common approach to building parsers is to write a tokenizer, which breaks the input text up into a sequence of "tokens", and then a parser that identifies larger (and perhaps recursive) structure from those tokens. Tokenizing is typically performed using a finite state machine, since it is able to rapidly cut down the search space. With Raku grammars, we don't write the tokenizer ourselves; instead, it's automatically extracted from the grammar for us (more precisely, a tokenizer is calculated per alternation point).

Secondly, Raku regexes are a nested language within the main Raku language, parsed in a single pass with it and compiled at the same time. (This is a departure from most languages, where regexes are provided as a library that we pass strings to.) The longest token calculation takes place at compile time. However, variables are interpolated at runtime. Therefore, a variable interpolation in a regex is non-declarative, and therefore is not considered as part of the longest token matching.