Regex (a?)* not exponential?
Asked Answered
R

1

12

I am currently looking into the problem of regular expressions which can end up running in exponential time when matched against a certain input, for example both (a*)* and (a|a)* potentially exhibit 'catastrophic backtracking' when matched against the string aaaaab - for every extra 'a' in the matched string, the time needed to attempt to match the string doubles. This is only the case if the engine uses a backtracking/NFA approach of attempting to try all possible branches in the tree before failing, such as that used in PCRE.

My question is, why isn't (a?)* vulnerable? Based on my understanding of backtracking, what should happen in the string "aaaab" is essentially what happens with (a|a)*. If we construct the NFA using the standard Thomspson NFA construction, surely for each epsilon transition that occurs, the engine will have to keep taking them and backtracking in the same way it would for the case of two a's? For example (omitting some steps and where @ replaces epsilon):

"aaaa" matches, but can't match 'b', fail (backtrack)
"aaaa@" matches, 'b' fail (backtrack)
"aaa@a" matches, 'b' fail (backtrack)
"aaa@a@" matches, 'b' fail (backtrack)
...
"@a@a@a@a@" matches, 'b' fails (backtrack)

trying all possible combinations of epsilons and a, surely leading to an exponential blowup of routes?

It would make sense to remove the epsilon transitions from the NFA, but I believe this has the effect of removing all non-determinism from the (a*)* pattern. This is definitely vulnerable though, so I'm not entirely sure what's going on!

Thank you very much in advance!

Edit: It has been pointed out by Qtax that epsilons can't still be present when the NFA is traversed with the traditional backtracking, otherwise (@)* will attempt to match forever. So what NFA implementation could possibly lead to (a*)* and (a|a)* being exponential, and (a?)* not being so? This is the crux of the question really.

Ravenna answered 26/8, 2012 at 23:53 Comment(5)
PCRE and Perl regex engines are optimized in many ways. All of your examples will fail quickly without catastrophic backtracking, even on huge strings.Schuss
I am aware of this, maybe a better example would be Java's regex engine then, which does not fail quickly. Early versions of PCRE and Perl suffered the same problems as well, and my question can be applied to them! However, on those engines that appear to be vulnerable to patterns such as (a*)* (i.e. unoptimised engines), (a?)* appears to be fine - why?!Ravenna
Optimizations. Quantified zero width expression (without side effects) is useless, and if it were used as you describe (try to match a then empty string for every repetition) then the regex would never fail, as you can repeat ()* indefinably.Schuss
A good point about ()*. I'm finding it difficult to understand how existing implementations optimise to stop this from happening though. A very naive implementation of simply constructing the NFA through Thompson's construction (e-moves and all), and then attempting to backtrack through it would never terminate for the ()* reason you mention, so what do they do that causes (a?)* to succeed, but (a*)* to fail? It can't be as simple as removing the e-moves altogether as I believe that has the side effect of causing (a*)* to be deterministicRavenna
If you want an example of how a regex engine handles these cases you can look at Perl's regex debug output, which shows you to what the expressions compile and every step of matching: ^(a*)*$ and ^(a?)*$. Example command (win): perl -Mre=debug -e "'aaab' =~ /^(a?)*$/"Schuss
R
4

Ok, after some sleuthing, I've eventually managed to find out that this is down to the use of 'barriers' in NFA implementations. Simply put, barriers are placed at strategic points in the NFA (such as on the node immediately after the 'a' transition in the NFA construction of a*). They require that the match has progressed on from the previous time that barrier was hit. This prevents the NFA from ever getting into a situation where it matches an infinite number of epsilons and allows it to terminate.

In other words, it is not possible to go from one barrier to the same barrier only matching e-moves - if this happened, the route is dropped and backtracking occurs from the previous point. This also has the side effect that (a?)* is not vulnerable to exponential blow-up, since the a? cannot match null the second time around.

Ravenna answered 28/8, 2012 at 16:58 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.