Regular expression: who's greedier?
Asked Answered
E

5

16

My primary concern is with the Java flavor, but I'd also appreciate information regarding others.

Let's say you have a subpattern like this:

(.*)(.*)

Not very useful as is, but let's say these two capture groups (say, \1 and \2) are part of a bigger pattern that matches with backreferences to these groups, etc.

So both are greedy, in that they try to capture as much as possible, only taking less when they have to.

My question is: who's greedier? Does \1 get first priority, giving \2 its share only if it has to?

What about:

(.*)(.*)(.*)

Let's assume that \1 does get first priority. Let's say it got too greedy, and then spit out a character. Who gets it first? Is it always \2 or can it be \3?

Let's assume it's \2 that gets \1's rejection. If this still doesn't work, who spits out now? Does \2 spit to \3, or does \1 spit out another to \2 first?


Bonus question

What happens if you write something like this:

(.*)(.*?)(.*)

Now \2 is reluctant. Does that mean \1 spits out to \3, and \2 only reluctantly accepts \3's rejection?


Example

Maybe it was a mistake for me not to give concrete examples to show how I'm using these patterns, but here's some:

System.out.println(
    "OhMyGod=MyMyMyOhGodOhGodOhGod"
    .replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><My><God>"

// same pattern, different input string
System.out.println(
    "OhMyGod=OhMyGodOhOhOh"
    .replaceAll("^(.*)(.*)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><MyGod><>"

// now \2 is reluctant
System.out.println(
    "OhMyGod=OhMyGodOhOhOh"
    .replaceAll("^(.*)(.*?)(.*)=(\\1|\\2|\\3)+$", "<$1><$2><$3>")
); // prints "<Oh><><MyGod>"
Ellisellison answered 2/4, 2010 at 9:28 Comment(0)
H
9

The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.

Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.

Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.

In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.

In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.

It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D

Hamrnand answered 3/4, 2010 at 0:2 Comment(3)
"It doesn't even matter which group captures which word" -- this is at the core of my question actually. When there are multiple solutions, which is actually picked? Is there a regex specification that says what the exact behavior should be, how things are clearly prioritized, etc? Also, you said RegexBuddy and n-steps etc, that sounds almost like a step debugger, which would be awesome. Will investigate. Thanks.Ellisellison
Oh, and yeah, I am basically thinking of these groups as 3 chicks that are competing to eat, binging and purging, etc, until hopefully they reach an orgasmic state of harmonious bliss. It makes working with regex even more fun than it is.Ellisellison
There's no standard or spec for regexes--at least not for NFA or regex-directed engines like those in Java, Perl, Python, .NET, etc.--but I'd be surprised if any of them yielded different results given the same inputs.Hamrnand
H
15

\1 will have priority, \2 and \3 will always match nothing. \2 will then have priority over \3.

As a general rule think of it like this, back-tracking will only occur to satisfy a match, it will not occur to satisfy greediness, so left is best :)

explaining back tracking and greediness is to much for me to tackle here, i'd suggest friedl's Mastering Regular Expressions

Hippel answered 2/4, 2010 at 9:31 Comment(0)
H
9

The addition of your concrete examples changes the nature of the question drastically. It still starts out as I described in my first answer, with the first (.*) gobbling up all the characters, and the second and third groups letting it have them, but then it has to match an equals sign.

Obviously there isn't one at the end of the string, so group #1 gives back characters one by one until the = in the regex can match the = in the target. Then the regex engine starts trying to match (\1|\2|\3)+$ and the real fun starts.

Group 1 gives up the d and group 2 (which is still empty) takes it, but the rest of the regex still can't match. Group 1 gives up the o and group 2 matches od, but the rest of the regex still can't match. And so it goes, with the third group getting involved, and the three of them slicing up the input in every way possible until an overall match is achieved. RegexBuddy reports that it takes 13,426 steps to get there.

In the first example, greediness (or lack of it) isn't really a factor; the only way a match can be achieved is if the words Oh, My and God are captured in separate groups, so eventually that's what happens. It doesn't even matter which group captures which word--that's just first come, first served, as I said before.

In the second and third examples it's only necessary to break the prefix into two chunks: Oh and MyGod. Group 2 captures MyGod in the second example because it's next in line and it's greedy, just like in the first example. In the third example, every time group 1 drops a character, group 2 (being reluctant) lets group 3 take it instead, so that's the one that ends up in possession of MyGod.

It's more complicated (and tedious) than that, of course, but I hope this answers your question. And I have to say, that's an interesting target string you chose; if it were possible for a regex engine to have an orgasm, I think these regexes would be the ones to bring it off. :D

Hamrnand answered 3/4, 2010 at 0:2 Comment(3)
"It doesn't even matter which group captures which word" -- this is at the core of my question actually. When there are multiple solutions, which is actually picked? Is there a regex specification that says what the exact behavior should be, how things are clearly prioritized, etc? Also, you said RegexBuddy and n-steps etc, that sounds almost like a step debugger, which would be awesome. Will investigate. Thanks.Ellisellison
Oh, and yeah, I am basically thinking of these groups as 3 chicks that are competing to eat, binging and purging, etc, until hopefully they reach an orgasmic state of harmonious bliss. It makes working with regex even more fun than it is.Ellisellison
There's no standard or spec for regexes--at least not for NFA or regex-directed engines like those in Java, Perl, Python, .NET, etc.--but I'd be surprised if any of them yielded different results given the same inputs.Hamrnand
H
2

Quantifiers aren't really greedy by default, they're just hasty. In your example, the first (.*) will start out by gobbling up everything it can without regard to the needs of the regex as a whole. Only then does it hand control to the next part, and if necessary it will give back some or all of what it just took (i.e., backtrack) so the rest of the regex can do its work.

That isn't necessary in this case because everything else can legally match zero characters. If the quantifiers were really greedy, the three groups would haggle until they had divided the input as evenly as possible; instead, the second and third groups let the first one keep what it took. They'll take it if it's put in front of them, but they won't fight for it. (That would be true even if they had possessive quantifiers, i.e, (.*)(.*+)(.*+).)

Making the second dot-star reluctant doesn't change anything, but switching the first one does. A reluctant quantifier starts out by matching only as much as it has to, then hands off to the next part. So the first group in (.*?)(.*)(.*) starts out by matching nothing, then the second group gobbles everything, and the third group cries "weee weee weee" all the way home.

Here's a bonus question for you: What happens if you make all three of the quantifiers reluctant? (Hint: In Java this is as much an API question as it is a regex question.)

Hamrnand answered 2/4, 2010 at 10:58 Comment(2)
Re: Q for me: it looks to me like as long as the regex fails to match, \3 has to start eating, and if it's full, then \3 vomits it all up, \2 takes a bite, and \3 starts eating again, etc.Ellisellison
That's true if you use matches(), because the regex is implicitly anchored at both ends. That's not the case with the find() method, so it matches nothing.Hamrnand
F
0

Regular Expressions work in a sequence, that means the Regex-evaluator will only leave a group when he can't find a solution to that group anymore, and eventually do some backtracking to make the string fit to the next group. If you execute this regex, you will get all your chars evaluated in the first group, none in the next ones (Question-sign doesn't matter either).

Florin answered 2/4, 2010 at 11:2 Comment(2)
Re: "you will get all your chars evaluated in the first group, none in the next ones" and "(Question-sign doesn't matter either)." -- I just added examples to show that neither statement is true.Ellisellison
You didn't worked properly with the groups, I tested the following Regex: String s = "Oh(MyGod"; System.out.println( s.replaceAll("^(\\w+)(.*)(.*)$", "<$1><$2><$3>") ); This returned what I expected: "<Oh><(My god><>". You don't have to assign the groups explicit, or at least you do this wrong.Florin
C
0

As a simple general rule: leftmost quantifier wins. So as long as the following quantifiers identify purely optional subpatterns (regardless of them being made ungreedy), the first takes all.

Convexoconvex answered 2/4, 2010 at 11:3 Comment(3)
"the first takes all" -- I just added an example to show that this isn't always the case.Ellisellison
Only if you add backreferences to your pattern, as all rules fall in front of the higher need to make the pattern actually match. In your original message there were no backreferences, only quantifiers.Convexoconvex
You can see the revision history and note that I've always said that these capture groups are subpatterns to be used with backreferences later on, because otherwise it's not very useful, etc -- but you're right, it was my mistake not to include examples to make that even more explicit the first time.Ellisellison

© 2022 - 2024 — McMap. All rights reserved.