After answering this other question and including a metaphor, I was asked to post an answer on this canonical page.
While refining a given regex, I endeavor to optimize the pattern by a simple hierarchy. Accuracy, then Efficiency, then Readabilty, then Brevity. Lower order criteria are typically irrelevant to consider when a higher order criteria is imperfect/suboptimal.
Please allow me to simplify an already simplified task in the asked question. Instead of matching c
, let's make one or more zero-width matches with PHP and compare /a\K/
versus /(?<=a)/
which both return the same result.
While comparing the two pattern designs for this simple demonstration, the Accuracy is identical.
Efficiency is where a clear winner and loser is determined.
$string = 'abcbacbacbabcbabcbabcccbbaaabacbabbcacbabca';
$regex1 = '/a\K/'; // 14 matches, 42 steps https://regex101.com/r/j4Jrrj/1
$regex2 = '/(?<=a)/'; // 14 matches, 167 steps https://regex101.com/r/JuZJDE/1
Here's why: With $regex2
, every step that the regex engine takes (each character in the string traversed), it has to turn around and check for a possible a
. Conversely, $regex1
marches swiftly through the characters (looking nowhere but most immediate character), if it encounters an a
, then a match is made and it doesn't need to look around. As a general tip, lookarounds (and capture groups) can lead to losses in efficiency -- its worth running some tests to see how small tweaks may effect your pattern efficiency.
A common scenario is to split a string without consuming any characters.
Splitting oneTwoThreeFourFiveSix
between a lowercase letter then an uppercase letter, using preg_split()
can be done with:
- "look both ways" -
/(?<=[a-z])(?=[A-Z])/
will take 70 steps on the 22-character string (because it needs to look one or both ways at every position).
- "match any lower, release, lookahead" -
/[a-z]\K(?=[A-Z])/
will take 78 steps (because of the volume of lowercase letters and then take 2 subsequent actions).
- "match consecutive lowers, release, lookahead" -
/[a-z]+\K(?=[A-Z])/
will take 41 steps (because it only takes additional actions on the last lower).
Of course: preg_match()
could traverse the string with /(?:^|[A-Z])[a-z]*/
in 38 steps, but that may be a different discussion on Accuracy in the context of a real project.
When we stop asking the regex engine to stop at every character, the competition can tighten up.
$string = 'abcbacbacbabcbabcbabcccbbaaabacbabbcacbabca';
$regex1 = '/ab\Kc/'; // 5 matches, 45 steps https://regex101.com/r/j4Jrrj/2
$regex2 = '/(?<=ab)c/'; // 5 matches, 44 steps https://regex101.com/r/JuZJDE/2
$regex1
traverses the string until it finds an a
, then a b
, then a c
.
$regex2
traverses the string until it finds a c
, then looks behind for a b
, then an a
.
The sample string contains a
14x, c
24x, ab
7x, and bc
6x.
The character existence and order has a direct impact on step count.
For the record, /abc/
requires only 38 steps, so the \K
costs 7 steps (for each ab
matched), but /ab(c)/
costs more -- 50 steps.
And to show a lookbehind outperform \K
, search for semicolons which occur after a digit in Price: $5666;Weight: 7145;Height: 420;Width: 411;
.
Because digits occur far more frequently than semicolons, /(?<=\d);/
takes 20 steps and /\d\K;/
takes 46 steps.
The takeaway
Asking the regex engine to take an action on every position in a string is an inefficient task -- even if you prefer the readability.
Beyond that, it is often the frequency and order of characters which determine the regex engines performance.