What is the difference between .*? and .* regular expressions?

Asked 19/6, 2010 at 10:16 Answered 19/6, 2010 at 12:52

229

I'm trying to split up a string into two parts using regex. The string is formatted as follows:

text to extract<number>

I've been using (.*?)< and <(.*?)> which work fine but after reading into regex a little, I've just started to wonder why I need the ? in the expressions. I've only done it like that after finding them through this site so I'm not exactly sure what the difference is.

Second answered 19/6, 2010 at 10:16 Comment(1)

see also #2301785 – Uvula 19/7, 2013 at 19:30

232

It is the difference between greedy and non-greedy quantifiers.

Consider the input 101000000000100.

Using 1.*1, * is greedy - it will match all the way to the end, and then backtrack until it can match 1, leaving you with 1010000000001.
.*? is non-greedy. * will match nothing, but then will try to match extra characters until it matches 1, eventually matching 101.

All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.

In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).

See Quantifier Cheat Sheet.

Subtractive answered 19/6, 2010 at 10:21 Comment(3)

Ah great I like that last one of anything but the > sign! – Second 19/6, 2010 at 13:58

Can you explain or show an example of how the greedy ? differs from the non-greedy ?? ? – Pigg 25/11, 2015 at 15:36

Sure. For the string "abc", the regex /\w\w?\w/ would match the full string "abc" - because ? is greedy. /\w\w??\w/ is lazy - it will only match "ab". It will only backtrack and match "abc" if it fails later. – Subtractive 26/11, 2015 at 16:34

299

On greedy vs non-greedy

Repetition in regex by default is greedy: they try to match as many reps as possible, and when this doesn't work and they have to backtrack, they try to match one fewer rep at a time, until a match of the whole pattern is found. As a result, when a match finally happens, a greedy repetition would match as many reps as possible.

The ? as a repetition quantifier changes this behavior into non-greedy, also called reluctant (in e.g. Java) (and sometimes "lazy"). In contrast, this repetition will first try to match as few reps as possible, and when this doesn't work and they have to backtrack, they start matching one more rept a time. As a result, when a match finally happens, a reluctant repetition would match as few reps as possible.

References

regular-expressions.info/Repetition - Laziness instead of Greediness

Example 1: From A to Z

Let's compare these two patterns: A.*Z and A.*?Z.

Given the following input:

eeeAiiZuuuuAoooZeeee

The patterns yield the following matches:

A.*Z yields 1 match: AiiZuuuuAoooZ (see on rubular.com)
A.*?Z yields 2 matches: AiiZ and AoooZ (see on rubular.com)

Let's first focus on what A.*Z does. When it matched the first A, the .*, being greedy, first tries to match as many . as possible.

eeeAiiZuuuuAoooZeeee
   \_______________/
    A.* matched, Z can't match

Since the Z doesn't match, the engine backtracks, and .* must then match one fewer .:

eeeAiiZuuuuAoooZeeee
   \______________/
    A.* matched, Z still can't match

This happens a few more times, until finally we come to this:

eeeAiiZuuuuAoooZeeee
   \__________/
    A.* matched, Z can now match

Now Z can match, so the overall pattern matches:

eeeAiiZuuuuAoooZeeee
   \___________/
    A.*Z matched

By contrast, the reluctant repetition in A.*?Z first matches as few . as possible, and then taking more . as necessary. This explains why it finds two matches in the input.

Here's a visual representation of what the two patterns matched:

eeeAiiZuuuuAoooZeeee
   \__/r   \___/r      r = reluctant
    \____g____/        g = greedy

Example: An alternative

In many applications, the two matches in the above input is what is desired, thus a reluctant .*? is used instead of the greedy .* to prevent overmatching. For this particular pattern, however, there is a better alternative, using negated character class.

The pattern A[^Z]*Z also finds the same two matches as the A.*?Z pattern for the above input (as seen on ideone.com). [^Z] is what is called a negated character class: it matches anything but Z.

The main difference between the two patterns is in performance: being more strict, the negated character class can only match one way for a given input. It doesn't matter if you use greedy or reluctant modifier for this pattern. In fact, in some flavors, you can do even better and use what is called possessive quantifier, which doesn't backtrack at all.

References

regular-expressions.info/Repetition - An Alternative to Laziness, Negated Character Classes and Possessive Quantifiers

Example 2: From A to ZZ

This example should be illustrative: it shows how the greedy, reluctant, and negated character class patterns match differently given the same input.

eeAiiZooAuuZZeeeZZfff

These are the matches for the above input:

A[^Z]*ZZ yields 1 match: AuuZZ (as seen on ideone.com)
A.*?ZZ yields 1 match: AiiZooAuuZZ (as seen on ideone.com)
A.*ZZ yields 1 match: AiiZooAuuZZeeeZZ (as seen on ideone.com)

Here's a visual representation of what they matched:

         ___n
        /   \              n = negated character class
eeAiiZooAuuZZeeeZZfff      r = reluctant
  \_________/r   /         g = greedy
   \____________/g

Related topics

These are links to questions and answers on stackoverflow that cover some topics that may be of interest.

One greedy repetition can outgreed another

Boutonniere answered 19/6, 2010 at 12:52 Comment(7)

I meant to say rubular.com, not ideone.com. To others: don't revise this post for me, I'll do it myself on next revision, along with other examples. Feel free to give feedback, suggestion, etc in comments so I can incorporate those as well. – Boutonniere 19/6, 2010 at 13:37

See also: #3145523 – Boutonniere 29/6, 2010 at 21:57

This answer has been added to the Stack Overflow Regular Expression FAQ, under "Quantifiers > More on the differences..." – Overtone 10/4, 2014 at 0:9

This answer really deserves to be the chosen answer!. Thank you so much for your detailed explanation. – Flower 3/10, 2019 at 13:49

I added the tag non-greedy. Why, because the question needed it, but also because it will funnel more users to this great answer. In other words if you give a great answer and the answer uses a tag not in the question, then add the tag because the OP did not know that the tag was revelent. – Anacoluthon 7/2, 2020 at 12:15

^festvox-[hikrdue][^n].*?[^8] refused to take out any instances with "8" in. this answer triggered thought process to use ^festvox-[hikrdue][^n][^8]* which still didn't work until i added "$" i.w. ^festvox-[hikrdue][^n].*?[^8] (comment written party as an Aide-mémoire for me) – Unrobe 3/12, 2020 at 13:32

One of the best answers. The reference provided is equally helpful in getting to the nuances of regex. Thanks. – Schaefer 9/11, 2022 at 6:20

232

It is the difference between greedy and non-greedy quantifiers.

Consider the input 101000000000100.

All quantifiers have a non-greedy mode: .*?, .+?, .{2,6}?, and even .??.

In your case, a similar pattern could be <([^>]*)> - matching anything but a greater-than sign (strictly speaking, it matches zero or more characters other than > in-between < and >).

See Quantifier Cheat Sheet.

Subtractive answered 19/6, 2010 at 10:21 Comment(3)

Ah great I like that last one of anything but the > sign! – Second 19/6, 2010 at 13:58

Can you explain or show an example of how the greedy ? differs from the non-greedy ?? ? – Pigg 25/11, 2015 at 15:36

Let's say you have:

<a></a>

<(.*)> would match a></a where as <(.*?)> would match a. The latter stops after the first match of >. It checks for one or 0 matches of .* followed by the next expression.

The first expression <(.*)> doesn't stop when matching the first >. It will continue until the last match of >.

Stumpy answered 19/6, 2010 at 10:20 Comment(2)

this is easier to understand than the above explanation. – Bragg 20/2, 2020 at 7:6

This is how explanations should be. – Uxoricide 10/8, 2020 at 23:0

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

On greedy vs non-greedy

References

Example 1: From A to Z

Example: An alternative

References

Example 2: From A to ZZ

Related topics

One greedy repetition can outgreed another

Recommended topics

Hot tags