Why does a simple .*? non-greedy regex greedily include additional characters before a match?
Asked Answered
B

3

5

I have a very simple regex similar to this:

HOHO.*?_HO_

With this test string...

fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_fbguyev

  • I expect it to match just _HOHO___HO_ (shortest match, non-greedy)
  • Instead it matches _HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO___HO_ (longest match, looks greedy).

Why? How can I make it match the shortest match?

Adding and removing the ? gives the same result.

Edit - better test string that shows why [^HOHO] doesn't work: fiwgu_HOHO_HOHO_HOHOrgh_HOHO_feh_HOHO_H_O_H_O_HO_fbguye


All I can think of is that maybe it is matching multiple times - but there's only one match for _HO_, so I don't understand why it isn't taking the shortest match that ends at the _HO_, discarding the rest.

I've browsed all the questions I can find with titles like "Non-greedy regex acts greedy", but they all seem to have some other problem.

Boyes answered 9/12, 2014 at 18:15 Comment(1)
It matches the shortest possible path from the first occurrence of HOHO; and it can do this without backtracking, so it won't try to shorten that.Oestrin
B
11

I figured out a solution with some help from Regex lazy vs greedy confusion.

In regex engines like the one used by Javascript (NFA engines I believe), non-greedy only gives you the match that is shortest going left to right - from the first left-hand match that fits to the nearest right-hand match.

Where there are many left-hand matches for one right-hand match, it will always go from the first it reaches (which will actually give the longest match).

Essentially, it goes through the string one character at a time asking "Are there matches from this character? If so, match the shortest and finish. If no, move to next character, repeat". I expected it to be "Are there matches anywhere in this string? If so, match the shortest of all of them".


You can approximate a regex that is non-greedy in both directions by replacing the . with a negation meaning "not the left-side match". To negate a string like this requires negative lookaheads and non-capturing groups, but it's as simple as dropping the string into (?:(?!).). For example, (?:(?!HOHO).)

For example, the equivalent of HOHO.*?_HO_ which is non-greedy on the left and right would be:

HOHO(?:(?!HOHO).)*?_HO_

So the regex engine is essentially going through each character like this:

  • HOHO - Does this match the left side?
  • (?:(?!HOHO).)* - If so, can I reach the right-hand side without any repeats of the left side?
  • _HO_ - If so, grab everything until the right-hand match
  • ? modifier on * or + - If there are multiple right-hand matches, choose the nearest one
Boyes answered 9/12, 2014 at 18:15 Comment(4)
The explanation of the first part is good, your example to solve the current problem is wrong. You need to better understand the character class notion and why writing [^HOHO] has no sense.Fango
You're right, thanks, it's finding characters that aren't H or O rather than testing against the string, I need to do something using negative lookarounds similar to this #977751 - will come back and edit when I have timeBoyes
Okay, think I've fixed it nowBoyes
It took me white a while to understand what (?:(?!HOHO).)* does or rather how it works. For me this is the explanation what I found made it clear: "From the current cursor, are the next 4 characters "HOHO"? If YES, we're done with this part and moving on to the next part of the regex. If NO, then grab the character at the cursor, move the cursor past it, repeat."Belgae
S
5

Why it matches the whole string?

This is because regular-expression pattern matching is done by finding the first position in the string at which a match is possible. Since a match is possible starting at the first character of the string, shorter matches starting at subsequent characters are never even considered.

Example:
Let's consider a regular expression /a+?b/ and test string "aaaaaaaaab". When applied to the string it matches the whole string. Not just last a & b. This is because the first position in the string where a match is possible is at the first a.

So, if you want to match ab in aaaaaaaaab, use a negated character class based regex rather than a lazy dot:

a[^ab]*b

See the regex demo.

Source: Javascript: The Definitive Guide, Sixth Edition, Page Number: 255

Stick answered 10/12, 2014 at 11:6 Comment(2)
It would be nice to mention your sources as well.Oestrin
Kindly read How to reference material written by others. TL;DR: You need to link to the source and provide with the name of the author or organization (if there's no individual author). In this case it seems David Flanagan is the author of the book.Belgae
G
4

The result is non-greedy, because it's the shortest match from the first occurrence of HOHO until _HO_ is reached; the engine traverses the string from left to right and because it doesn't have to backtrack, it won't attempt to shorten anything.

To make it work in the way that's expected here, you need to have a greedy prefix in your expression:

/.*(HOHO.*?_HO_)/

The first memory capture contains the string that you're after; the greedy prefix will try to skip as many characters as possible, so it will match the last occurrence of HOHO first.

Gusman answered 10/12, 2014 at 11:3 Comment(2)
I've been reading blog.stevenlevithan.com/archives/greedy-lazy-performance to try to better understand backtracking. Would it be true that in this, it would match the entire string, then backtrack until it finds the string's last HOHO, then match forwards until it reaches _HO_? So in a string like HOHO_1_HO_|HOHO_2_HO_|HOHO_3_HO_ it would match HOHO_3_HO_ only? I'm also wondering if the example in my answer might be more efficient in case of very long strings?Boyes
@user568458 Right, that's what my answers suggests; i.e. it will match the last occurrence of "HOHO" first :) performance may be impacted, but only a benchmark will give some assurance.Oestrin

© 2022 - 2024 — McMap. All rights reserved.