How does Perl apparently avoid catastrophic backtracking?
Asked Answered
M

2

7

I am using Perl 5.34.1 and Python 3.10.12.

The following script takes 16 seconds in Python:

import re
patt = re.compile(r"W(X|Y+)+Z")
print(patt.search("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA"))

But takes almost no time at all in Perl:

use v5.34;
my $patt = qr/W(X|Y+)+Z/;
print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);

This is an example of catastrophic backtracking provided by https://medium.com/bigpanda-engineering/catastrophic-backtracking-the-dark-side-of-regular-expressions-80cab9c443f6.

I would expect that any backtracking regex engine would suffer in this case, but apparently the Perl regex engine does not. How does it avoid the catastrophic backtracking in this example? Is it actually backtracking "catastrophically" internally, but orders of magnitude faster than the Python engine? Is the some fast-path optimization it uses to prevent fail early before completing the full backtracking sequence?

Methyl answered 27/9, 2023 at 13:48 Comment(3)
I'm not sure of the precise algorithm used in perl, but it's done by memoizing partial results. Once it's determined that (X|Y+)*Z fails for some suffix of the string, it isn't necessary to test that same subexpression against that same suffix again.Ibson
@MattTimmermans if you can elaborate a bit, that might still be worth an answer, even if it's not a complete answer, because I think it would be useful for other people who are interested in this. Or if you know of any existing questions on other sites, cross-linking would be appreciated as well.Methyl
fservant.github.io/papers/…Ibson
A
9

Because of optimizations. It realizes a Z is needed, but it doesn't find one.

$ perl -Mre=debug -e'
    use v5.34;
    my $patt = qr/W(X|Y+)+Z/;
    print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
'
Compiling REx "W(X|Y+)+Z"
Final program:
   1: EXACT <W> (3)
   3: CURLYX<1:1>[0]{1,INFTY} (21)
   7:   OPEN1 (9)
   9:     BRANCH (buf:1/1) (13)
  11:       EXACT <X> (18)
  13:     BRANCH (buf:1/1) (FAIL)
  15:       PLUS (18)
  16:         EXACT <Y> (0)
  18:   CLOSE1 (20)
  20: WHILEM[1/1] (0)
  21: NOTHING (22)
  22: EXACT <Z> (24)
  24: END (0)
anchored "W" at 0..0 floating "Z" at 2..9223372036854775807 (checking floating) minlen 3
Matching REx "W(X|Y+)+Z" against "WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA"
Intuit: trying to determine minimum start position...
  doing 'check' fbm scan, [2..30] gave -1
  Did not find floating substr "Z"...
Match rejected by optimizer
Freeing REx: "W(X|Y+)+Z"

That said, using qr/W(X|Y+)+(Z|!)/ bypasses that optimization.

$ perl -Mre=debug -e'
    use v5.34;
    my $patt = qr/W(X|Y+)+Z/;
    print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
22

$ perl -Mre=debug -e'
    use v5.34;
    my $patt = qr/W(X|Y+)+(Z|!)/;
    print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
2971

Still, the difference is less than 1 ms on my machine.

$ time perl -e'
    use v5.34;
    my $patt = qr/W(X|Y+)+Z/;
    print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
0

real    0m0.002s
user    0m0.002s
sys     0m0.000s

$ time perl -e'
    use v5.34;
    my $patt = qr/W(X|Y+)+(Z|!)/;
    print("WYYYYYYYYYYYYYYYYYYYYYYYYYYYYA" =~ $patt);
' 2>&1 | wc -l
0

real    0m0.002s
user    0m0.002s
sys     0m0.000s

Obviously, there are other optimizations. -Mre=debug with this version includes this message:

WHILEM: Detected a super-linear match, switching on caching...

You could try looking if Yves "demerphq" Orton has done any talks on this.

Arlynearlynne answered 27/9, 2023 at 14:27 Comment(0)
C
3

Amongst other optimisations, perl's regex engine has a thing called the superlinear cache. After a certain number of iterations, the engine decides that the match has gone "superlinear", and allocates a cache which is an array of bytes of a similar length to the string. It uses bits within this cache to mark certain match positions within the string which have failed at a certain point in the pattern in such a way that they would always fail again if a similar position was reached. Thus endless retries of similar cases are skipped. Whether this technique is used depends on the details of the pattern.

Courtly answered 28/9, 2023 at 16:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.