Matching text between delimiters: greedy or lazy regular expression?
Asked Answered
B

3

18

For the common problem of matching text between delimiters (e.g. < and >), there's two common patterns:

  • using the greedy * or + quantifier in the form START [^END]* END, e.g. <[^>]*>, or
  • using the lazy *? or +? quantifier in the form START .*? END, e.g. <.*?>.

Is there a particular reason to favour one over the other?

Barbiturism answered 29/8, 2011 at 8:12 Comment(0)
L
12

Some advantages:

[^>]*:

  • More expressive.
  • Captures newlines regardless of /s flag.
  • Considered quicker, because the engine doesn't have to backtracks to find a successful match (with [^>] the engine doesn't make choices - we give it only one way to match the pattern against the string).

.*?

  • No "code duplication" - the end character only appears once.
  • Simpler in cases the end delimiter is more than a character long. (a character class would not work in this case) A common alternative is (?:(?!END).)*. This is even worse if the END delimiter is another pattern.
Libava answered 29/8, 2011 at 8:18 Comment(1)
Note that [^>]* will only not backtrack if it is followed by what is inside the negated class ([^>]*> in this case). Kobi, I know you know and probably meant this, but wanted to make sure others don't think that [^>]* and [^>]*+ (possessive) are the same. Besides that, nice answer!Hauge
S
7

The first is more explicit, i. e. it definitely excludes the closing delimiter from being part of the matched text. This is not guaranteed in the second case (if the regular expression is extended to match more than just this tag).

Example: If you try to match <tag1><tag2>Hello! with <.*?>Hello!, the regex will match

<tag1><tag2>Hello!

whereas <[^>]*>Hello! will match

<tag2>Hello!
Sassenach answered 29/8, 2011 at 8:20 Comment(2)
Nice example that in certain circumstances reluctant matching can match two substrings where many people expect it to match just one.Hauge
+1, great example. It was really hard to choose an answer this time, but I took Kobis, since he lists pros and cons of both options.Barbiturism
O
7

What most people fail to consider when approaching questions like this is what happens when the regex is unable to find a match. That's when the killer performance sinkholes are most likely to appear. For example, take Tim's example, where you're looking for something like <tag>Hello!. Consider what happens with:

<.*?>Hello!

The regex engine finds a < and it quickly finds a closing >, but not >Hello!. So the .*? continues looking for a > that is followed by Hello!. If there isn't one, it will go all the way to the end of the document before it gives up. Then the regex engine resumes scanning until it finds another <, and tries again. We already know how that's going to turn out, but the regex engine, typically, doesn't; it goes through the same rigamarole with every < in the document. Now consider the other regex:

<[^>]*>Hello!

As before, it quickly matches from the < to the >, but fails to match Hello!. It will backtrack to the <, then quit and start scanning for another <. It will still check every < like the first regex did, but it won't search all the way to the end of the document every time it finds one.

But it's even worse than that. If you think about it, .*? is effectively equivalent to a negative lookahead. It's saying "Before consuming the next character, make sure the remainder of the regex can't match at this position." In other words,

/<.*?>Hello!/

...is equivalent to:

/<(?:(?!>Hello!).)*(?:>Hello!|\z(*FAIL))/

So at every position you're performing, not just a normal match attempt, but a much more expensive lookahead. (It's at least twice as costly, because the lookahead has to scan at least one character, then the . goes ahead and consumes a character.)

((*FAIL) is one of Perl's backtracking-control verbs (also supported in PHP). |\z(*FAIL) means "or reach the end of the document and give up".)

Finally, there's another advantage of the negated-character-class approach. While it doesn't (as @Bart pointed out) act like the quantifier is possessive, there's nothing to stop you from making it possessive, if your flavor supports it:

/<[^>]*+>Hello!/

...or wrap it in an atomic group:

/(?><[^>]*>)Hello!/

Not only will these regexes never backtrack unnecessarily, they don't have to save the state information that makes backtracking possible.

Ouse answered 29/8, 2011 at 9:50 Comment(2)
Good answer. A rather important point here, however, is that the comparison of <.*?>Hello! to <[^>]*>Hello! isn't quite fair. Your end delimiter in this case is actually >Hello!, not >, and [^>] cannot handle that at all. I tried to refer to that in the last point of my answer.Libava
Yes, appending Hello! to the original regex effectively changes the closing delimiter from a single character to a multi-character sequence. And that turns the .*? version into a potential black hole, while the [^>]* version still works fine. I'm saying that in isolation, there's practically nothing to choose between the two styles; let the regex get just a little more complex, though, and the choice becomes crucially important.Ouse

© 2022 - 2024 — McMap. All rights reserved.