I am using Perl 5.34.1 and Python 3.10.12.
The following script takes 16 seconds in Python:
import re
patt = re.compile(r"W(X|Y+)+Z")
print(patt.search("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA"))
But takes almost no time at all in Perl:
use v5.34;
my $patt = qr/W(X|Y+)+Z/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
This is an example of catastrophic backtracking provided by https://medium.com/bigpanda-engineering/catastrophic-backtracking-the-dark-side-of-regular-expressions-80cab9c443f6.
I would expect that any backtracking regex engine would suffer in this case, but apparently the Perl regex engine does not. How does it avoid the catastrophic backtracking in this example? Is it actually backtracking "catastrophically" internally, but orders of magnitude faster than the Python engine? Is the some fast-path optimization it uses to prevent fail early before completing the full backtracking sequence?
Because of optimizations. It realizes a Z
is needed, but it doesn't find one.
$ perl -Mre=debug -e'
use v5.34;
my $patt = qr/W(X|Y+)+Z/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
'
Compiling REx "W(X|Y+)+Z"
Final program:
1: EXACT <W> (3)
3: CURLYX<1:1>[0]{1,INFTY} (21)
7: OPEN1 (9)
9: BRANCH (buf:1/1) (13)
11: EXACT <X> (18)
13: BRANCH (buf:1/1) (FAIL)
15: PLUS (18)
16: EXACT <Y> (0)
18: CLOSE1 (20)
20: WHILEM[1/1] (0)
21: NOTHING (22)
22: EXACT <Z> (24)
24: END (0)
anchored "W" at 0..0 floating "Z" at 2..9223372036854775807 (checking floating) minlen 3
Matching REx "W(X|Y+)+Z" against "WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [2..30] gave -1
Did not find floating substr "Z"...
Match rejected by optimizer
Freeing REx: "W(X|Y+)+Z"
That said, using qr/W(X|Y+)+(Z|!)/
bypasses that optimization.
$ perl -Mre=debug -e'
use v5.34;
my $patt = qr/W(X|Y+)+Z/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
22
$ perl -Mre=debug -e'
use v5.34;
my $patt = qr/W(X|Y+)+(Z|!)/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
2971
Still, the difference is less than 1 ms on my machine.
$ time perl -e'
use v5.34;
my $patt = qr/W(X|Y+)+Z/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
0
real 0m0.002s
user 0m0.002s
sys 0m0.000s
$ time perl -e'
use v5.34;
my $patt = qr/W(X|Y+)+(Z|!)/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
0
real 0m0.002s
user 0m0.002s
sys 0m0.000s
Obviously, there are other optimizations. -Mre=debug
with this version includes this message:
WHILEM: Detected a super-linear match, switching on caching...
You could try looking if Yves "demerphq" Orton has done any talks on this.