Can I have a non-greedy regex with dotall?
Asked Answered
S

2

9

I would like to match dotall and non-greedy. This is what I have:

img(.*?)(onmouseover)+?(.*?)a

However, this is not being non-greedy. This data is not matching as I expected:

<img src="icon_siteItem.gif" alt="siteItem" title="A version of this resource is available on siteItem" border="0"></a><br><br></td><td rowspan="4" width="20"></td></tr><tr><td>An activity in which students find other more specific adjectives to 
describe a range of nouns, followed by writing a postcard to describe a 
nice holiday without using the word 'nice'.</td></tr><tr><td>From the resource collection: <a href="http://www.siteItem.co.uk/index.asp?CurrMenu=searchresults&amp;tag=326" title="Resources to help work">Drafting </a></td></tr><tr><td><abbr style="border-bottom:0px" title="Key Stage 3">thing</abbr> | <abbr style="border-bottom:0px" title="Key Stage 4">hello</abbr> | <abbr style="border-bottom:0px" title="Resources">Skills</abbr></td></tr></tbody></table></div></div></td></tr><tr><td><div style="padding-left: 30px"><div><table style="" bgcolor="#DFE7EE" border="0" cellpadding="0" cellspacing="5" width="100%"><tbody><tr valign="top"><td rowspan="4" width="60"><a href="javascript:requiresLevel0(350,350);"><img name="/attachments/3700.pdf" onmouseover="ChangeImageOnRollover(this,'/application/files/images/attach_icons/rollover_pdf.gif')" onmouseout="ChangeImageOnRollover(this,'/application/files/images/attach_icons/small_pdf.gif')" src="small_pdf.gif" alt="Download Recognising and avoiding ambiguity in PDF format" title="Download in PDF format" style="vertical-align: middle;" border="0"></a><br>790.0 k<br>

and I cannot understand why.

What I think I am stating in the above regex is:

start with "img", then allow 0 or more any character including new line, then look for at least 1 "onmouseover", then allow 0 or more any character including new line, then an "a"

Why doesn't this work as I expected?

KEY POINT: dotall must be enabled

Separates answered 29/2, 2012 at 22:41 Comment(6)
This seems to work perfectly. I get a match on img name="/attachments/3700.pdf" onmouseover="ChaIntercessory
@jurgemaister do you have dotall enabled?Separates
No, I don't. Guess I didn't read the question carefully enough. In that case it matches everything from the second character to the point i mentioned above. Which also would be expected.Intercessory
@jurgemaister we all make mistakes. I made one once!Separates
Non-greedy matching means that it will stop at the first possible character. It does not mean that it will start at the last possible character, which you seem to be expecting.Gelatinate
What are you actually trying to achieve with this? It'd be easier to suggest improvements if we knew the aim, and some example results...Verso
A
15

It is being non-greedy. It is your understanding of non-greedy that is not correct.

A regex will always try to match.

Let me show a simplified example of what non-greedy actually means(as suggested by a comment):

re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)

This will match:

  • as few repetitions of 'a' as possible (in this case 2)
  • followed by a 'b'
  • and as few repetitions of 'c' as possible (in this case 0)

so the only match is 'aab'.

And just to conclude:

Don't use regex to parse HTML. There are libraries that were made for the job. re is not one of them.

Abed answered 29/2, 2012 at 23:3 Comment(5)
as a simplified example, use re.findall(r'a*?bc*?', 'aabcc', re.DOTALL)Edie
Why doesn't your example just return b? Why can it get away with matching 'c' zero times but it must match 'a' twice?Rearm
Found an answer to my question: #16633815Rearm
plus1 for your conclusion "Don't use regex to parse html". I was exactly trying to do that.Dian
You should probably add two examples for libaries that could be used to parse HTML. (I know that beautiful soup works well).Tavares
T
5

First of all, your regex looks a little funky: you're saying match "img", then any number of characters, "onmouseover" at least once, but possibly repeated (e.g. "onmouseoveronmouseoveronmouseover"), followed by any number of characters, followed by "a".

This should match from img src="icon_ all the way to onmouseover="Cha. That's probably not what you want, but it's what you asked for.

Second, and this is significanly more important:

DON'T USE REGULAR EXPESSIONS TO PARSE HTML.

And in case you didn't understand it the first time, let me repeat it in italics:

DON'T USE REGULAR EXPESSIONS TO PARSE HTML.

Finally, let me link you to the canonical grimoire on the subject:

You can't parse [X]HTML with a regex

Tiga answered 29/2, 2012 at 23:20 Comment(1)
@tchirst: What you created in the post you linked is an HTML parser which uses regexes to build its lexer. It is clever, it's powerful. But what it isn't is a regular expression which describes HTML, because state (depth) has to be maintained separately. It's useful as a third-party parser library, but the whole point of discouraging beginners from parsing HTML with regexes is to encourage them to use tested, purpose-built parser libraries. HTML is far more complex than can be safely captured in a single-line regex written by a beginner. (Also, loved your books)Tiga

© 2022 - 2024 — McMap. All rights reserved.