I am currently looking into the problem of regular expressions which can end up running in exponential time when matched against a certain input, for example both (a*)*
and (a|a)*
potentially exhibit 'catastrophic backtracking' when matched against the string aaaaab - for every extra 'a' in the matched string, the time needed to attempt to match the string doubles. This is only the case if the engine uses a backtracking/NFA approach of attempting to try all possible branches in the tree before failing, such as that used in PCRE.
My question is, why isn't (a?)*
vulnerable? Based on my understanding of backtracking, what should happen in the string "aaaab" is essentially what happens with (a|a)*
. If we construct the NFA using the standard Thomspson NFA construction, surely for each epsilon transition that occurs, the engine will have to keep taking them and backtracking in the same way it would for the case of two a's? For example (omitting some steps and where @ replaces epsilon):
"aaaa" matches, but can't match 'b', fail (backtrack)
"aaaa@" matches, 'b' fail (backtrack)
"aaa@a" matches, 'b' fail (backtrack)
"aaa@a@" matches, 'b' fail (backtrack)
...
"@a@a@a@a@" matches, 'b' fails (backtrack)
trying all possible combinations of epsilons and a, surely leading to an exponential blowup of routes?
It would make sense to remove the epsilon transitions from the NFA, but I believe this has the effect of removing all non-determinism from the (a*)*
pattern. This is definitely vulnerable though, so I'm not entirely sure what's going on!
Thank you very much in advance!
Edit: It has been pointed out by Qtax that epsilons can't still be present when the NFA is traversed with the traditional backtracking, otherwise (@)*
will attempt to match forever. So what NFA implementation could possibly lead to (a*)*
and (a|a)*
being exponential, and (a?)*
not being so? This is the crux of the question really.
(a*)*
(i.e. unoptimised engines),(a?)*
appears to be fine - why?! – Ravennaa
then empty string for every repetition) then the regex would never fail, as you can repeat()*
indefinably. – Schuss()*
. I'm finding it difficult to understand how existing implementations optimise to stop this from happening though. A very naive implementation of simply constructing the NFA through Thompson's construction (e-moves and all), and then attempting to backtrack through it would never terminate for the()*
reason you mention, so what do they do that causes(a?)*
to succeed, but(a*)*
to fail? It can't be as simple as removing the e-moves altogether as I believe that has the side effect of causing(a*)*
to be deterministic – Ravenna^(a*)*$
and^(a?)*$
. Example command (win):perl -Mre=debug -e "'aaab' =~ /^(a?)*$/"
– Schuss