Python regex: Difference between (.+) and (.+?)

Asked 10/7, 2014 at 3:34 Answered 10/7, 2014 at 3:38

I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!

ps. this is a starbucks stock quote scraper.

import urllib
import re

url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)

print found

Vicereine answered 10/7, 2014 at 3:34 Comment(3)

possible duplicate of Difference between .*? and .* for regex – Mediocrity 10/7, 2014 at 3:40

FYI, it is a bad idea to parse HTML with regex – Arezzini 10/7, 2014 at 3:40

Okay calm down bud, I'm just using this as a learning tool. I am new regex and urllib and I thought this would be a nice sandbox exercise. – Vicereine 10/7, 2014 at 4:21

.+ is greedy -- it matches until it can't match any more and gives back only as much as needed.

.+? is not -- it stops at the first opportunity.

Examples:

Assume you have this HTML:

<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>

This regex matches the whole thing:

<span id="yfs_l84_sbux">(.+)<\/span>

It goes all the way to the end, then "gives back" one , but the rest of the regex matches that last , so the complete regex matches the entire HTML chunk.

But this regex stops at the first :

<span id="yfs_l84_sbux">(.+?)<\/span>

Justajustemilieu answered 10/7, 2014 at 3:38 Comment(2)

You are awesome! Easy to understand explanation. Maybe you could explain this to me as well. Why, in my case, does re.search return a strange looking string, while re.findall gives me that data I am actually wanting? – Vicereine 10/7, 2014 at 4:24

You might find this answer helpful. findall returns a list of the groups captured, while search returns everything that matched. In your case, that includes the HTML tags. I'm a bit of a beginner in Python, but I believe you will get the same result from both methods if you remove the parentheses from the regex. – Justajustemilieu 10/7, 2014 at 13:4

? is a non-greedy modifier. * by default is a greedy repetition operator - it will gobble up everything it can; when modified by ? it becomes non-greedy and will eat up only as much as will satisfy it.

Thus for

<span id="yfs_l84_sbux">want</span>text<span id="somethingelse">dontwant</span>

.*? will eat up want, then hit  - and this satisfies the regexp with minimal repetitions of ., resulting in want being the match. However, .* will try to see if it can eat more - it will go and find the other , with .*? matching wanttextdontwant, resulting in what you got - much more than you wanted.

Glenine answered 10/7, 2014 at 3:36 Comment(2)

*? The OP actually uses +. – Donalt 10/7, 2014 at 3:41

Doh. Well, same reasoning. – Glenine 10/7, 2014 at 3:41

(.+) is greedy. It takes what it can and gives back when needed.

(.+?) is ungreedy. It takes as few as possible.

See:

delegate

[delegate] /^(.+)e/
[de]legate /^(.+?)e/

Also, comparing the "Regex debugger log" here and here will show you what the ungreedy modifier does more effectively.

Donalt answered 10/7, 2014 at 3:38 Comment(0)

Recommended topics

Hot tags