Python regex: Difference between (.+) and (.+?)
Asked Answered
V

3

8

I am new to regex and Python's urllib. I went through an online tutorial on web scraping and it had the following code. After studying up on regular expressions, it seemed to me that I could use (.+) instead of the (.+?) in my regex, but whoa was I wrong. I ended up printing way more html code than I wanted. I thought I was getting the hang of regex, but now I am confused. Please explain to me the difference between these two expressions and why it is grabbing so much html. Thanks!

ps. this is a starbucks stock quote scraper.

import urllib
import re

url = urllib.urlopen("http://finance.yahoo.com/q?s=SBUX")
htmltext = url.read()
regex = re.compile('<span id="yfs_l84_sbux">(.+?)</span>')
found = re.findall(regex, htmltext)

print found

Vicereine answered 10/7, 2014 at 3:34 Comment(3)
possible duplicate of Difference between .*? and .* for regexMediocrity
FYI, it is a bad idea to parse HTML with regexArezzini
Okay calm down bud, I'm just using this as a learning tool. I am new regex and urllib and I thought this would be a nice sandbox exercise.Vicereine
J
11

.+ is greedy -- it matches until it can't match any more and gives back only as much as needed.

.+? is not -- it stops at the first opportunity.

Examples:

Assume you have this HTML:

<span id="yfs_l84_sbux">foo bar</span><span id="yfs_l84_sbux2">foo bar</span>

This regex matches the whole thing:

<span id="yfs_l84_sbux">(.+)<\/span>

It goes all the way to the end, then "gives back" one </span>, but the rest of the regex matches that last </span>, so the complete regex matches the entire HTML chunk.

But this regex stops at the first </span>:

<span id="yfs_l84_sbux">(.+?)<\/span>
Justajustemilieu answered 10/7, 2014 at 3:38 Comment(2)
You are awesome! Easy to understand explanation. Maybe you could explain this to me as well. Why, in my case, does re.search return a strange looking string, while re.findall gives me that data I am actually wanting?Vicereine
You might find this answer helpful. findall returns a list of the groups captured, while search returns everything that matched. In your case, that includes the HTML tags. I'm a bit of a beginner in Python, but I believe you will get the same result from both methods if you remove the parentheses from the regex.Justajustemilieu
G
3

? is a non-greedy modifier. * by default is a greedy repetition operator - it will gobble up everything it can; when modified by ? it becomes non-greedy and will eat up only as much as will satisfy it.

Thus for

<span id="yfs_l84_sbux">want</span>text<span id="somethingelse">dontwant</span>

.*?</span> will eat up want, then hit </span> - and this satisfies the regexp with minimal repetitions of ., resulting in <span id="yfs_l84_sbux">want</span> being the match. However, .* will try to see if it can eat more - it will go and find the other </span>, with .*? matching want</span>text<span id="somethingelse">dontwant, resulting in what you got - much more than you wanted.

Glenine answered 10/7, 2014 at 3:36 Comment(2)
*? The OP actually uses +.Donalt
Doh. Well, same reasoning.Glenine
D
1

(.+) is greedy. It takes what it can and gives back when needed.

(.+?) is ungreedy. It takes as few as possible.

See:

delegate

[delegate] /^(.+)e/
[de]legate /^(.+?)e/

Also, comparing the "Regex debugger log" here and here will show you what the ungreedy modifier does more effectively.

Donalt answered 10/7, 2014 at 3:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.