Why does my non-greedy Perl regex still match too much?
Asked Answered
R

4

11

Say, I have a line that contains the following string:

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

and I want to extract

"$dick" said "blah blah blah"

I have the following code:

my ($term) = /(".+?" said ".+?")/g;
print $term;

But it gives me more than I need:

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

I tried grouping my pattern as a whole by using the non-capturing parens:

my ($term) = /((?:".+?" said ".+?"))/g;

But the problem persists.

I've reread the Nongreedy Quantifiers section of Learning Perl but it's got me nowhere so far.

Thanks for any guidance you can generously offer :)

Rogerrogerio answered 21/10, 2009 at 5:52 Comment(3)
The first "blah blah blash" is not in quotes, so your regexp is grabbing the second set.Foxy
@Ether, my problem is: I thought Perl can treat my pattern as a whole. But I was wrong. Thing are becoming clearer to me that Perl always tries to match the first sub pattern and then the next subpattern and on and on. Seems there's no such thing as "simultaneous whole pattern match".Rogerrogerio
@brian, thanks. I like the way you rephrased my question :)Rogerrogerio
H
19

The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see

"$tom" said blah blah blash.

and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?" matches

"$tom" said blah blah blash.  "$dick"

What you want is "[^"]+". This will match two quote marks enclosing anything that's not a quote mark. So the final solution:

("[^"]+" said "[^"]+")
Haar answered 21/10, 2009 at 5:57 Comment(10)
Thanks Chris, for the solution and the explanation :) Thanks a lot for the englightenment!Rogerrogerio
++! Another good reason to use [^"]+ is that it will cut down on unnecessary back-tracking and make your regex more efficient.Motmot
@Motmot - I discussed this with Alex Martelli in his answer. This is often cited, but even for 10,000,000 comparisons the performance difference is still almost unnoticeable, both in Perl and in Python. It seems the mantra changes from "the compiler will optimize that" to "the regex engine will optimize that." :P (See my comment to his answer for the results of my timings.)Haar
@Chris Lutz: I think daotoad was comparing "[^"]+" to ".*+", not to "[^"]+?".Thier
@Chris,@daotaod,@ysth, I have another question. I changed the string to ""$tom" said blah blah blash. "$dick" said "blah blah blah". "$harry" said blah blah blah. "$jane" said "blah blah blah"." Then I tried "print $term" but I only get "$dick" said "blah blah blah"? Why ""$jane" said "blah blah blah" is not captured? Thanks!Rogerrogerio
I already used the g modifier: "my ($term) = /("[^"]+" said "[^"]+")/g; print $term \n;" but things are not right. What am I doing wrong?Rogerrogerio
Okiedokie, the following code worked fine but I still want to know why "print $term" only prints the first capture : "my @group = ($_=~ m/("[^"]+" said "[^"]+")/g); print "@group \n";Rogerrogerio
Well, I tried the following code and it worked. But it's quite inconvenient : my ($first,$second) = /("[^"]+" said "[^"]+")/g; print $first; print $second;Rogerrogerio
Because my ($term) = something assigns the return of something (as a list) to the list ($term) and since that list only has one item to assign to, Perl silently discards the excess. The correct way to capture an arbitrary number of matches with the /g modifier is to use a list like @group so that it always has enough room. You can also use the /g modifier in a while loop to iterate through matches, and then perform an action for each. Or you can use map for the same goal: map { do stuff } (/regex/g);Haar
@Chris, thanks for the clarification. I already figured out how to use @group to capture more than one match but just didn't understand why my ($term) construct failed. Now I see the point. It's different from $_=~ construct. Thanks :)Rogerrogerio
L
3

Unfortunately " is a peculiar-enough character to need to be treated carefully. Use:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.

Lunik answered 21/10, 2009 at 6:0 Comment(3)
Woah! The Lord of Python answers a Perl question! Pinch me, I'm dreaming. (As a note, the non-greedy match isn't necessary anymore, and will probably just slow the regex engine down. But this does work, as you say.)Haar
I was an excellent Perl'er back in Perl 4 time (and this question's quite germane to Perl 4;-) -- it's the weird novelties of Perl 5 that threw me off and eventually made a Pythonista out of me!-) (Back in "Perl 4" times, non-greedy was never less efficient than greedy -- haven't looked at any current Perl RE engines to see if they've spoiled THOSE, too, though!-).Lunik
There's no extreme difference, but after testing I've found that the greedy "[^"]+" is slightly faster than "[^"]+?". It's not a significant difference: 7.13 seconds for the greedy version, 7.68 for the non-greedy (for 20,000,000 tests). It's doubtful that this insignificant of a difference will ever affect real-life code, but it is good to note. (For completeness, in Python the greedy one takes 15.15 seconds and the non-greedy one takes 16.47 seconds. Same data set.)Haar
U
3

Others have mentioned how to fix this.

I'll answer how you can debug this: you can see what's happening by using more captures:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""
Unionize answered 21/10, 2009 at 6:1 Comment(0)
P
2

Your problem here is that there are two possible matches for your regexp, the one you want (a shorter one) and the one the regex engine chooses. The engine chooses that specific match because it prefers a match that starts earlier in the string and is longer to a match that starts later and is shorter. In other words, early matches win over shorter ones.

To solve this you need to make your regex more specific (as in telling the engine that $term should not contain any quotes. It's a good idea to make your regexes as specific as possible anyway.

For more details and gotchas regarding regular expressions, I recommend Jeffrey Friedl's excellent book: Mastering Regular Expressions

Parsec answered 25/10, 2009 at 21:27 Comment(1)
@kixx, thanks for the explanation and the book recommendation.Rogerrogerio

© 2022 - 2024 — McMap. All rights reserved.