Regex: Is Lazy Worse?
Asked Answered
T

6

12

I have always written regexes like this

<A HREF="([^"]*)" TARGET="_blank">([^<]*)</A>

but I just learned about this lazy thing and that I can write it like this

<A HREF="(.*?)" TARGET="_blank">(.*?)</A>

is there any disadvantage to using this second approach? The regex is definitely more compact (even SO parses it better).

Edit: There are two best answers here, which point out two important differences between the expressions. ysth's answer points to a weakness in the non-greedy/lazy one, in which the hyperlink itself could possibly include other attributes of the A tag (definitely not good). Rob Kennedy points out a weakness in the greedy example, in that anchor texts cannot include other tags (definitely not okay, because it wouldn't grab all the anchor text either)... so the answer is that, regular expressions being what they are, lazy and non-lazy solutions that seem the same are probably not semantically equivalent.

Edit: Third best answer is by Alan M about relative speed of the expressions. For the time being, I'll mark his as best answer so people give him more points :)

Troytroyer answered 14/12, 2008 at 18:36 Comment(4)
Sure thing, but apparently once the question ages a bit no one loves it anymore.Troytroyer
If you can change the accepted answer, feel free to do so. My answer didn't really answer the question, it just elaborated on the other answers.Costumer
I don't agree. There are three aspects: matching stuff you don't want, not matching stuff you do want, and how much work it's going to be for the processor. No one hit more than one issue.Troytroyer
Oh, I should mention that those are three more aspects than I knew about before posting the question, so it's been a GREAT help, so thanks to all three of you!Troytroyer
C
12

Another thing to consider is how long the target text is, and how much of it is going to be matched by the quantified subexpression. For example, if you were trying to match the whole <BODY> element in a large HTML document, you might be tempted to use this regex:

/<BODY>.*?<\/BODY>/is

But that's going to do a whole lot of unnecessary work, matching one character at a time while effectively doing a negative lookahead before each one. You know the </BODY> tag is going to be very near the end of the document, so the smart thing to do is to use a normal greedy quantitier; let it slurp up the whole rest of the document and then backtrack the few characters necessary to match the end tag.

In most cases you won't notice any speed difference between greedy and reluctant quantifiers, but it's something to keep in mind. The main reason why you should be judicious in your use of reluctant quantifiers is the one that was pointed out by the others: they may do it reluctantly, but they will match more than you want them to if that's what it takes to achieve an overall match.

Costumer answered 15/12, 2008 at 1:57 Comment(0)
A
8

The complemented character class more rigorously defines what you want to match, so whenever you can, I'd use it.

The non greedy regex will match things you probably don't want, such as:

<A HREF="foo" NAME="foo" TARGET="_blank">foo</A>

where your first .*? matches

foo" NAME="foo
Avictor answered 14/12, 2008 at 19:51 Comment(8)
I don't get your last remark. In your opinion, what would be matched here and why would it be different from what we want?Hypothesize
Doesn't the first .*? match as few characters as possible before matching the double quote, thus only matching foo?Lao
ysth: I now see your point, i.e. that the arguments are reordered.Hypothesize
It's hard to say what I would want to match in that case because it's not legal HTML (or at least makes no sense to me).Troytroyer
Kenny: no, .*? will first try matching through the first double quote, but if that doesn't permit a successful match, it will go on to the second double quote, etc.Avictor
@Daniel Rosenstark, I'm changing the example to be legal.Avictor
Ah I had the same trouble as Konrad, thanks for clearing that up ysth.Lao
Wow! I had not thought about that. You're totally right... if it's not precisely the regex that I'm looking for, it gets in the match. For my current case, I'm okay because the code looks JUST like I need, but in general I now see the difference. Problem is I can't mark two best answers...Troytroyer
S
7

Note that your examples are not equivalent. Your first regular expression will not select any links that contain other tags, such as img or b. The second regular expression will, and I expect that's probably what you wanted anyway.

Besides the difference in meaning, the only disadvantage I can think of is that support for non-greedy modifiers isn't quite as prevalent as character-class negation is. It's more widely supported than I thought, before I checked, but notably absent from the list is GNU Grep. If the regular-expression evaluators you're using support it, then go ahead and use it.

Skyline answered 14/12, 2008 at 19:16 Comment(1)
Hi Rob, it's true, I do want ANYTHING that can go between the A tags. Whether my regex evaluator supports it... wow, I didn't even know that it couldn't. I'll have to check (I'm in AS3) and I'll edit the question with that.Troytroyer
K
3

It's not about better or worse. The term I've seen the most is greedy vs. non-greedy, but however you put they do two different things. You want to use the right one for the task. I.e. turn off the greedy option when you don't want to capture multiple matches in a line.

Kowal answered 14/12, 2008 at 18:40 Comment(0)
H
1

“lazy” is the wrong word here. You mean non-greedy as opposed to greedy. There's no disadvantage in using it, that I know of. But in your special case, neither should it be more efficient.

Hypothesize answered 14/12, 2008 at 18:40 Comment(4)
Thanks for your answer. These guys regular-expressions.info/repeat.html refer to lazy or greedy, which I admit makes less sense than greedy and non-greedy.Troytroyer
It might interest you to know that "those guys" is actually Jan Goyvaerts, an SO member. ;)Costumer
Yeah, I really can't complain about the quality of SO memeber's. The last time I used a technical forum with this level of responses was the xSLT forum, and a famous guru named David Carlile (something like that) answered most of the questions personally.Troytroyer
Java calls them "reluctant". The quantifiers are greedy, possessive or reluctant.Dnepropetrovsk
Z
1

Non-greedy is better, is it not? It works forward, checking for a match each time and stopping when it finds one, whereas the normal kleene closure (*) works backwards matching the rest of the input and removing things until it finds a match.

In the end, they do different things, but I think non-greedy outperforms greedy. Bear in mind that I haven't tested this, but now I'm curious.

Zarathustra answered 14/12, 2008 at 18:42 Comment(1)
bet it's implementation dependent. Thanks for your answer!Troytroyer

© 2022 - 2024 — McMap. All rights reserved.