Why does my non-greedy Perl regex still match too much?

Asked 21/10, 2009 at 5:52 Answered 25/10, 2009 at 21:27

Say, I have a line that contains the following string:

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

and I want to extract

"$dick" said "blah blah blah"

I have the following code:

my ($term) = /(".+?" said ".+?")/g;
print $term;

But it gives me more than I need:

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

I tried grouping my pattern as a whole by using the non-capturing parens:

my ($term) = /((?:".+?" said ".+?"))/g;

But the problem persists.

I've reread the Nongreedy Quantifiers section of Learning Perl but it's got me nowhere so far.

Thanks for any guidance you can generously offer :)

Rogerrogerio answered 21/10, 2009 at 5:52 Comment(3)

The first "blah blah blash" is not in quotes, so your regexp is grabbing the second set. – Foxy 25/10, 2009 at 22:49

@Ether, my problem is: I thought Perl can treat my pattern as a whole. But I was wrong. Thing are becoming clearer to me that Perl always tries to match the first sub pattern and then the next subpattern and on and on. Seems there's no such thing as "simultaneous whole pattern match". – Rogerrogerio 26/10, 2009 at 2:40

@brian, thanks. I like the way you rephrased my question :) – Rogerrogerio 26/10, 2009 at 2:43

The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see

"$tom" said blah blah blash.

and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?" matches

"$tom" said blah blah blash.  "$dick"

What you want is "[^"]+". This will match two quote marks enclosing anything that's not a quote mark. So the final solution:

("[^"]+" said "[^"]+")

Haar answered 21/10, 2009 at 5:57 Comment(10)

Thanks Chris, for the solution and the explanation :) Thanks a lot for the englightenment! – Rogerrogerio 21/10, 2009 at 6:5

++! Another good reason to use [^"]+ is that it will cut down on unnecessary back-tracking and make your regex more efficient. – Motmot 21/10, 2009 at 6:18

@Motmot - I discussed this with Alex Martelli in his answer. This is often cited, but even for 10,000,000 comparisons the performance difference is still almost unnoticeable, both in Perl and in Python. It seems the mantra changes from "the compiler will optimize that" to "the regex engine will optimize that." :P (See my comment to his answer for the results of my timings.) – Haar 21/10, 2009 at 6:31

@Chris Lutz: I think daotoad was comparing "[^"]+" to ".*+", not to "[^"]+?". – Thier 21/10, 2009 at 9:6

@Chris,@daotaod,@ysth, I have another question. I changed the string to ""$tom" said blah blah blash. "$dick" said "blah blah blah". "$harry" said blah blah blah. "$jane" said "blah blah blah"." Then I tried "print $term" but I only get "$dick" said "blah blah blah"? Why ""$jane" said "blah blah blah" is not captured? Thanks! – Rogerrogerio 21/10, 2009 at 9:38

I already used the g modifier: "my ($term) = /("[^"]+" said "[^"]+")/g; print $term \n;" but things are not right. What am I doing wrong? – Rogerrogerio 21/10, 2009 at 9:41

Okiedokie, the following code worked fine but I still want to know why "print $term" only prints the first capture : "my @group = ($_=~ m/("[^"]+" said "[^"]+")/g); print "@group \n"; – Rogerrogerio 21/10, 2009 at 10:15

Well, I tried the following code and it worked. But it's quite inconvenient : my ($first,$second) = /("[^"]+" said "[^"]+")/g; print $first; print $second; – Rogerrogerio 21/10, 2009 at 10:23

Because my ($term) = something assigns the return of something (as a list) to the list ($term) and since that list only has one item to assign to, Perl silently discards the excess. The correct way to capture an arbitrary number of matches with the /g modifier is to use a list like @group so that it always has enough room. You can also use the /g modifier in a while loop to iterate through matches, and then perform an action for each. Or you can use map for the same goal: map { do stuff } (/regex/g); – Haar 21/10, 2009 at 21:31

@Chris, thanks for the clarification. I already figured out how to use @group to capture more than one match but just didn't understand why my ($term) construct failed. Now I see the point. It's different from $_=~ construct. Thanks :) – Rogerrogerio 22/10, 2009 at 0:54

Unfortunately " is a peculiar-enough character to need to be treated carefully. Use:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.

Lunik answered 21/10, 2009 at 6:0 Comment(3)

Woah! The Lord of Python answers a Perl question! Pinch me, I'm dreaming. (As a note, the non-greedy match isn't necessary anymore, and will probably just slow the regex engine down. But this does work, as you say.) – Haar 21/10, 2009 at 6:2

I was an excellent Perl'er back in Perl 4 time (and this question's quite germane to Perl 4;-) -- it's the weird novelties of Perl 5 that threw me off and eventually made a Pythonista out of me!-) (Back in "Perl 4" times, non-greedy was never less efficient than greedy -- haven't looked at any current Perl RE engines to see if they've spoiled THOSE, too, though!-). – Lunik 21/10, 2009 at 6:7

There's no extreme difference, but after testing I've found that the greedy "[^"]+" is slightly faster than "[^"]+?". It's not a significant difference: 7.13 seconds for the greedy version, 7.68 for the non-greedy (for 20,000,000 tests). It's doubtful that this insignificant of a difference will ever affect real-life code, but it is good to note. (For completeness, in Python the greedy one takes 15.15 seconds and the non-greedy one takes 16.47 seconds. Same data set.) – Haar 21/10, 2009 at 6:29

Others have mentioned how to fix this.

I'll answer how you can debug this: you can see what's happening by using more captures:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""

Unionize answered 21/10, 2009 at 6:1 Comment(0)

Your problem here is that there are two possible matches for your regexp, the one you want (a shorter one) and the one the regex engine chooses. The engine chooses that specific match because it prefers a match that starts earlier in the string and is longer to a match that starts later and is shorter. In other words, early matches win over shorter ones.

To solve this you need to make your regex more specific (as in telling the engine that $term should not contain any quotes. It's a good idea to make your regexes as specific as possible anyway.

For more details and gotchas regarding regular expressions, I recommend Jeffrey Friedl's excellent book: Mastering Regular Expressions

Parsec answered 25/10, 2009 at 21:27 Comment(1)

@kixx, thanks for the explanation and the book recommendation. – Rogerrogerio 26/10, 2009 at 0:44

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags