Why is the minimal (non-greedy) match affected by the end of string character '$'?
Asked Answered
F

6

7

EDIT: remove original example because it provoked ancillary answers. also fixed the title.

The question is why the presence of the "$" in the regular expression effects the greedyness of the expression:

Here is a simpler example:

>>> import re
>>> str = "baaaaaaaa"
>>> m = re.search(r"a+$", str)
>>> m.group()
'aaaaaaaa'
>>> m = re.search(r"a+?$", str)
>>> m.group()
'aaaaaaaa'

The "?" seems to be doing nothing. Note the when the "$" is removed, however, then the "?" is respected:

>>> m = re.search(r"a+?", str)
>>> m.group()
'a'

EDIT: In other words, "a+?$" is matching ALL of the a's instead of just the last one, this is not what I expected. Here is the description of the regex "+?" from the python docs: "Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched."

This does not seem to be the case in this example: the string "a" matches the regex "a+?$", so why isn't the match for the same regex on the string "baaaaaaa" just a single a (the rightmost one)?

Furey answered 3/5, 2011 at 23:44 Comment(6)
Would you mind clarifying your question a little? I'm having trouble understanding exactly what it is you want. What do you mean by "the first match?" Are you talking about the .+?Portcullis
There may be a better way to do this with another library (in the context of paths) but this is fundamentally a question about regular expressions.Furey
What I mean by first match is the first search(), I'll edit.Furey
@Furey Alright, then. I said this in my answer below, but it's because you placed everything in the parentheses, so everything was placed in the group. There's nothing outside of the parentheses to match the rest of the string.Portcullis
@Portcullis i've added a second example that makes the issue clearer. your answer below does not address the issue.Furey
@Furey Edited my answer again.Portcullis
B
4

Matches are "ordered" by "left-most, then longest"; however "longest" is the term used before non-greedy was allowed, and instead means something like "preferred number of repetitions for each atom". Being left-most is more important than the number of repetitions. Thus, "a+?$" will not match the last A in "baaaaa" because matching at the first A starts earlier in the string.

(Answer changed after OP clarification in comments. See history for previous text.)

Brunson answered 4/5, 2011 at 0:37 Comment(6)
@krumpelstiltskin: That was a copy-paste error. I originally used match instead of search, re-read your question, then added in 'c' and forgot to replace two of them.Brunson
I don't think this explains why the search() is not respecting the '?' in my example above.Furey
@krumpelstiltskin: It does, but maybe the new paragraph above will explain it better.Brunson
@Fred Nurk: but why doesn't it match the last "a" in my example? none of your examples correspond to this possiblity.Furey
@krumpelstiltskin: Ah, that clears things up. It just doesn't work that way. Left-most matches are considered more important than "longer" (which in this case means shorter, as you switch from greedy to non-greedy) matches.Brunson
@Fred Nurk: this is the answer I wanted. Could you put a link to the very nice document that @Alan Moore gives telling us that Python is a regex-directed language and that it's "eager" to make a match (left-most)?Furey
M
4

The non-greedy modifier only affects where the match stops, never where it starts. If you want to start the match as late as possible, you will have to add .+? to the beginning of your pattern.

Without the $, your pattern is allowed to be less greedy and stop sooner, because it doesn't have to match to the end of the string.

EDIT:

More details... In this case:

re.search(r"a+?$", "baaaaaaaa")

the regex engine will ignore everything up until the first 'a', because that's how re.search works. It will match the first a, and would "want" to return a match, except it doesn't match the pattern yet because it must reach a match for the $. So it just keeps eating the a's one at a time and checking for $. If it were greedy, it wouldn't check for the $ after each a, but only after it couldn't match any more a's.

But in this case:

re.search(r"a+?", "baaaaaaaa")

the regex engine will check if it has a complete match after eating the first match (because it's non-greedy) and succeed because there is no $ in this case.

Malayoindonesian answered 4/5, 2011 at 0:50 Comment(1)
This is the best answer I've seen in that it is a description of what is happening. It doesn't explain why it is happening though.Furey
F
4

The presence of the $ in the regular expression does not affect the greediness of the expression. It merely adds another condition which must be met for the overall match to succeed.

Both a+ and a+? are required to consume the first a they find. If that a is followed by more a's, a+ goes ahead and consumes them too, while a+? is content with just the one. If there were anything more to the regex, a+ would be willing to settle for fewer a's, and a+? would consume more, if that's what it took to achieve a match.

With a+$ and a+?$, you've added another condition: match at least one a followed by the end of the string. a+ still consumes all of the a's initially, then it hands off to the anchor ($). That succeeds on the first try, so a+ is not required to give back any of its a's.

On the other hand, a+? initially consumes just the one a before handing off to $. That fails, so control is returned to a+?, which consumes another a and hands off again. And so it goes, until a+? consumes the last a and $ finally succeeds. So yes, a+?$ does match the same number of a's as a+$, but it does so reluctantly, not greedily.

As for the leftmost-longest rule that was mentioned elsewhere, that never did apply to Perl-derived regex flavors like Python's. Even without reluctant quantifiers, they could always return a less-then-maximal match thanks to ordered alternation. I think Jan's got the right idea: Perl-derived (or regex-directed) flavors should be called eager, not greedy.

I believe the leftmost-longest rule only applies to POSIX NFA regexes, which use NFA engines under under the hood, but are required to return the same results a DFA (text-directed) regex would.

Fractocumulus answered 5/5, 2011 at 22:39 Comment(3)
That's it! The leftmost-longest rule is the reason. This answer is long-winded though. Is there any chance you could shorten it just to say that it matches the longest matching substring that is leftmost. Is there a link to some python documentation saying this?Furey
I found similar thread talking about this situation but the answer there has no link to the appropriate documentation: hereFurey
That's just it, though: a regex-directed engine like Python's doesn't hold out for the longest match. It takes the first match it finds, even if longer matches were available from the same starting point. And none of this is exclusive to Python. PHP, Java, .NET: all the popular Perl-derived flavors work the same way. Did you follow the second link in my answer? Jan does good job of explaining these issues here: regular-expressions.info/engine.htmlFractocumulus
P
1

There are two issues going on, here. You used group() without specifying a group, and I can tell you are getting confused between the behavior of regular expressions with an explicitly parenthesized group and without a parenthesized group. This behavior without parentheses that you are observing is just a shortcut that Python provides, and you need to read the documentation on group() to understand it fully.

>>> import re
>>> string = "baaa"
>>> 
>>> # Here you're searching for one or more `a`s until the end of the line.
>>> pattern = re.search(r"a+$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # This means the same thing as above, since the presence of the `$`
>>> # cancels out any meaning that the `?` might have.
>>> pattern = re.search(r"a+?$", string)
>>> pattern.group()
'aaa'
>>> 
>>> # Here you remove the `$`, so it matches the least amount of `a` it can.
>>> pattern = re.search(r"a+?", string)
>>> pattern.group()
'a'

Bottom line is that the string a+? matches one a, period. However, a+?$ matches a's until the end of the line. Note that without explicit grouping, you'll have a hard time getting the ? to mean anything at all, ever. In general, it's better to be explicit about what you're grouping with parentheses, anyway. Let me give you an example with explicit groups.

>>> # This is close to the example pattern with `a+?$` and therefore `a+$`.
>>> # It matches `a`s until the end of the line. Again the `?` can't do anything.
>>> pattern = re.search(r"(a+?)$", string)
>>> pattern.group(1)
'aaa'
>>>
>>> # In order to get the `?` to work, you need something else in your pattern
>>> # and outside your group that can be matched that will allow the selection
>>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up
>>> # everything that the lazy `a+?` doesn't want to.
>>> pattern = re.search(r"(a+?).*$", string)
>>> pattern.group(1)
'a'

Edit: Removed text related to old versions of the question.

Portcullis answered 3/5, 2011 at 23:46 Comment(7)
well... i actually want just the beginning part, so I'm using re.sub() to get rid of the matching substring.Furey
did you see how the size of the match changes with the inclusion/exclusion of "$"? this is the curious point.Furey
@arussell84: see my comments to Fred Nurk.Furey
@Furey You mean you don't understand why the ? isn't respected in a+?$? It's because the $ is more important. Like I said, a+?$ means to match a's until the end of the line ($). Actually, in none of your examples do the ? even matter.Portcullis
@Portcullis it's clear that the '?' is not doing anything... in fact, my question is, why isn't it? see my edit in the question above for more detail and a quote from the python documentation.Furey
@Furey You have been given the answer to your question in many different forms on this page. If you don't understand it by now, I'm not sure how to help you. Do you need an example of where the ? does work or something?Portcullis
@Furey I've edited my answer again to remove stuff that pertained to the older versions of your question. I realized that I already had an example in there of how to get the ? to work. Basically, the ? inside the group is only lazy when it's allowed to be, and you need something else outside of your group to match the rest of the string that you don't want in your group.Portcullis
P
1

Answer to original question:

Why does the first search() span multiple "/"s rather than taking the shortest match?

A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding. In your example, the last subpattern is $, so the previous ones need to stretch out to the end of the string.

Answer to revised question:

A non-greedy subpattern will take the shortest match consistent with the whole pattern succeeding.

Another way of looking at it: A non-greedy subpattern will initially match the shortest possible match. However if this causes the whole pattern to fail, it will be retried with an extra character. This process continues until the subpattern fails (causing the whole pattern to fail) or the whole pattern matches.

Philippine answered 4/5, 2011 at 0:29 Comment(5)
This answer is outdated since I simplified the example.Furey
@krumpelstiltskin: The only part that is outdated is the quote from your original question. The remainder essentially gives a correct answer to both questions -- however I will update it.Philippine
@Johh Machin: but the whole pattern does succeed "a+?$" on just the last character "a" of the string "baaaaaaaa".Furey
@krumpelstiltskin: Exactly. "This process continues until (False or the whole pattern matches)". I don't understand your "but".Philippine
you say that initially the shortest possible match will be performed, "However if this causes the whole pattern to fail, it will be retried with an extra character". In my example the shortest possible match does not cause the "whole pattern to fail": the whole pattern is "a+?$", which fits the substring "a" from str no problem.Furey
R
0

Unless your question isn't including some important information, you don't need, and shouldn't use, regex for this task.

>>> import os
>>> p = "/we/shant/see/this/butshouldseethis"
>>> os.path.basename(p)
butshouldseethis
Rother answered 3/5, 2011 at 23:49 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.