It seems that using a character class is faster than the alternation in an example like:
[abc]
vs (a|b|c)
I have heard about it being recommended and with a simple test using Time::HiRes
I verified it (~10 times slower).
Also using (?:a|b|c)
in case the capturing parenthesis makes a difference does not change the result.
But I can not understand why. I think it is because of backtracking but the way I see it at each position there are 3 character comparison so I am not sure how backtracking hits in affecting the alternation. Is it a result of the implementation's nature of alternation?
This is because the "OR" construct |
backtracks between the alternation: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat
Pattern: [rf]at
Match string: carat
But to be short, the fact that pcre engine optimizes this (single literal characters -> character class) away is already a decent hint that alternations are inefficient.
Because a character class like [abc]
is irreducable and can be optimised, whereas an alternation like (?:a|b|c)
may also be (?:aa(?!xx)|[^xba]*?|t(?=.[^t])t)
.
The authors have chosen not to optimise the regex compiler to check that all elements of an alternation are a single character.
There is a big difference between "check that the next character is in this character class" and "check that the rest of the string matches any one of these regular expressions".
[abc] is irreducable
Don't you mean reducable? I.e. can be reduced to something simpler? –
Kiona perl -Mre=debug -E'qr/a|b|c/'
–
Fourway [a-zA-Z]
or [^"]
. –
Marcasite © 2022 - 2024 — McMap. All rights reserved.
pcretest
. Here's the results. Quite funny to see that PCRE somehow improves[abc]
to[a-c]
and that there seems to have a lot more "steps" when using a capturing group. – Intumescencereturn x == 'A' || x == 'b' || x =='c'
, in the second case the engine needs to restart the evaluation at a much higher level. – Rectify