regexperl

How does Perl apparently avoid catastrophic backtracking?


I am using Perl 5.34.1 and Python 3.10.12.

The following script takes 16 seconds in Python:

import re
patt = re.compile(r"W(X|Y+)+Z")
print(patt.search("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA"))

But takes almost no time at all in Perl:

use v5.34;
my $patt = qr/W(X|Y+)+Z/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);

This is an example of catastrophic backtracking provided by https://medium.com/bigpanda-engineering/catastrophic-backtracking-the-dark-side-of-regular-expressions-80cab9c443f6.

I would expect that any backtracking regex engine would suffer in this case, but apparently the Perl regex engine does not. How does it avoid the catastrophic backtracking in this example? Is it actually backtracking "catastrophically" internally, but orders of magnitude faster than the Python engine? Is the some fast-path optimization it uses to prevent fail early before completing the full backtracking sequence?


Solution

  • Because of optimizations. It realizes a Z is needed, but it doesn't find one.

    $ perl -Mre=debug -e'
        use v5.34;
        my $patt = qr/W(X|Y+)+Z/;
        print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
    '
    Compiling REx "W(X|Y+)+Z"
    Final program:
       1: EXACT <W> (3)
       3: CURLYX<1:1>[0]{1,INFTY} (21)
       7:   OPEN1 (9)
       9:     BRANCH (buf:1/1) (13)
      11:       EXACT <X> (18)
      13:     BRANCH (buf:1/1) (FAIL)
      15:       PLUS (18)
      16:         EXACT <Y> (0)
      18:   CLOSE1 (20)
      20: WHILEM[1/1] (0)
      21: NOTHING (22)
      22: EXACT <Z> (24)
      24: END (0)
    anchored "W" at 0..0 floating "Z" at 2..9223372036854775807 (checking floating) minlen 3
    Matching REx "W(X|Y+)+Z" against "WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA"
    Intuit: trying to determine minimum start position...
      doing 'check' fbm scan, [2..30] gave -1
      Did not find floating substr "Z"...
    Match rejected by optimizer
    Freeing REx: "W(X|Y+)+Z"
    

    That said, using qr/W(X|Y+)+(Z|!)/ bypasses that optimization.

    $ perl -Mre=debug -e'
        use v5.34;
        my $patt = qr/W(X|Y+)+Z/;
        print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
    ' 2>&1 | wc -l
    22
    
    $ perl -Mre=debug -e'
        use v5.34;
        my $patt = qr/W(X|Y+)+(Z|!)/;
        print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
    ' 2>&1 | wc -l
    2971
    

    Still, the difference is less than 1 ms on my machine.

    $ time perl -e'
        use v5.34;
        my $patt = qr/W(X|Y+)+Z/;
        print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
    ' 2>&1 | wc -l
    0
    
    real    0m0.002s
    user    0m0.002s
    sys     0m0.000s
    
    $ time perl -e'
        use v5.34;
        my $patt = qr/W(X|Y+)+(Z|!)/;
        print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
    ' 2>&1 | wc -l
    0
    
    real    0m0.002s
    user    0m0.002s
    sys     0m0.000s
    

    Obviously, there are other optimizations. -Mre=debug with this version includes this message:

    WHILEM: Detected a super-linear match, switching on caching...
    

    You could try looking if Yves "demerphq" Orton has done any talks on this.