Positive lookbehind vs match reset (\K) regex feature
Asked Answered
H

2

9

I just learned about the apparently undocumented \K behavior in Ruby regex (thanks to this answer by anubhava). This feature (possibly named Keep?) also exists in PHP, Perl, and Python regex flavors. It is described elsewhere as "drops what was matched so far from the match to be returned."

"abc".match(/ab\Kc/)     # matches "c"

Is this behavior identical to the positive lookbehind marker as used below?

"abc".match(/(?<=ab)c/)  # matches "c"

If not, what differences do the two exhibit?

Higley answered 29/1, 2016 at 19:31 Comment(1)
One difference is, that the lookbehind does not consume characters. See this example with lookbehind vs this with use of \K.Giffer
B
10

It's easier to see the difference between \K and (?<=...) with the String#scan method.

A lookbehind is a zero-width assertion that doesn't consume characters and that is tested (backwards) from the current position:

> "abcdefg".scan(/(?<=.)./)
=> ["b", "c", "d", "e", "f", "g"]

The "keep" feature \K (that isn't an anchor) defines a position in the pattern where all that was matched so far by the pattern on the left is removed from the match result. But all characters matched before the \K are consumed, they just don't appear in the result:

> "abcdefg".scan(/.\K./)
=> ["b", "d", "f"]

The behaviour is the same as without \K:

> "abcdefg".scan(/../)
=> ["ab", "cd", "ef"]

except that the characters before the \K are removed from the result.

One interesting use of \K is to emulate a variable-length lookbehind, which is not allowed in Ruby (the same for PHP and Perl), or to avoid the creation of a unique capture group. For example (?<=a.*)f. can be implemented using \K:

> "abcdefg".match(/a.*\Kf./)
=> #<MatchData "fg">

An alternative way would be to write /a.*(f.)/, but the \K avoids the need to create a capture group.

Note that the \K feature also exists in the python regex module, even this one allows variable-length lookbehinds.

Bibbs answered 29/1, 2016 at 20:12 Comment(1)
I don't know about Ruby, but \K is faster than (?<=...) in Perl. /// One normally uses \K specifically to avoid having to use a capture group, so it's weird that you would suggest the opposite.Artamas
B
2

After answering this other question and including a metaphor, I was asked to post an answer on this canonical page.

While refining a given regex, I endeavor to optimize the pattern by a simple hierarchy. Accuracy, then Efficiency, then Readabilty, then Brevity. Lower order criteria are typically irrelevant to consider when a higher order criteria is imperfect/suboptimal.


Please allow me to simplify an already simplified task in the asked question. Instead of matching c, let's make one or more zero-width matches with PHP and compare /a\K/ versus /(?<=a)/ which both return the same result.

  1. While comparing the two pattern designs for this simple demonstration, the Accuracy is identical.

  2. Efficiency is where a clear winner and loser is determined.

    $string = 'abcbacbacbabcbabcbabcccbbaaabacbabbcacbabca';
    
    $regex1 = '/a\K/'; // 14 matches, 42 steps https://regex101.com/r/j4Jrrj/1
    
    $regex2 = '/(?<=a)/'; // 14 matches, 167 steps https://regex101.com/r/JuZJDE/1
    

Here's why: With $regex2, every step that the regex engine takes (each character in the string traversed), it has to turn around and check for a possible a. Conversely, $regex1 marches swiftly through the characters (looking nowhere but most immediate character), if it encounters an a, then a match is made and it doesn't need to look around. As a general tip, lookarounds (and capture groups) can lead to losses in efficiency -- its worth running some tests to see how small tweaks may effect your pattern efficiency.


A common scenario is to split a string without consuming any characters. Splitting oneTwoThreeFourFiveSix between a lowercase letter then an uppercase letter, using preg_split() can be done with:

  1. "look both ways" - /(?<=[a-z])(?=[A-Z])/ will take 70 steps on the 22-character string (because it needs to look one or both ways at every position).
  2. "match any lower, release, lookahead" - /[a-z]\K(?=[A-Z])/ will take 78 steps (because of the volume of lowercase letters and then take 2 subsequent actions).
  3. "match consecutive lowers, release, lookahead" - /[a-z]+\K(?=[A-Z])/ will take 41 steps (because it only takes additional actions on the last lower).

Of course: preg_match() could traverse the string with /(?:^|[A-Z])[a-z]*/ in 38 steps, but that may be a different discussion on Accuracy in the context of a real project.


When we stop asking the regex engine to stop at every character, the competition can tighten up.

   $string = 'abcbacbacbabcbabcbabcccbbaaabacbabbcacbabca';

   $regex1 = '/ab\Kc/'; // 5 matches, 45 steps https://regex101.com/r/j4Jrrj/2

   $regex2 = '/(?<=ab)c/'; // 5 matches, 44 steps https://regex101.com/r/JuZJDE/2

$regex1 traverses the string until it finds an a, then a b, then a c.
$regex2 traverses the string until it finds a c, then looks behind for a b, then an a.

The sample string contains a 14x, c 24x, ab 7x, and bc 6x. The character existence and order has a direct impact on step count. For the record, /abc/ requires only 38 steps, so the \K costs 7 steps (for each ab matched), but /ab(c)/ costs more -- 50 steps.


And to show a lookbehind outperform \K, search for semicolons which occur after a digit in Price: $5666;Weight: 7145;Height: 420;Width: 411;. Because digits occur far more frequently than semicolons, /(?<=\d);/ takes 20 steps and /\d\K;/ takes 46 steps.

The takeaway

Asking the regex engine to take an action on every position in a string is an inefficient task -- even if you prefer the readability.

Beyond that, it is often the frequency and order of characters which determine the regex engines performance.

Bal answered 25/5, 2023 at 23:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.