Using regular expressions to parse HTML: why not?
Asked Answered
L

18

240

It seems like every question on stackoverflow where the asker is using regex to grab some information from HTML will inevitably have an "answer" that says not to use regex to parse HTML.

Why not? I'm aware that there are quote-unquote "real" HTML parsers out there like Beautiful Soup, and I'm sure they're powerful and useful, but if you're just doing something simple, quick, or dirty, then why bother using something so complicated when a few regex statements will work just fine?

Moreover, is there just something fundamental that I don't understand about regex that makes them a bad choice for parsing in general?

Leckie answered 26/2, 2009 at 14:24 Comment(7)
i think this is a dupe of stackoverflow.com/questions/133601Adna
Because only Chuck Norris can parse HTML with regex (as explained in this famous Zalgo thing: #1732848).Banger
This question prompted me to ask another one which is somehow related. In case you are interested: Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's termsGrimaldo
Beware of ZalgoAssonance
This question has been added to the Stack Overflow Regular Expression FAQ, under "Common Validation Tasks".Sometimes
Canonical question: RegEx match open tags except XHTML self-contained tagsRama
Possible duplicate of Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's termsRuyle
R
237

Entire HTML parsing is not possible with regular expressions, since it depends on matching the opening and the closing tag which is not possible with regexps.

Regular expressions can only match regular languages but HTML is a context-free language and not a regular language (As @StefanPochmann pointed out, regular languages are also context-free, so context-free doesn't necessarily mean not regular). The only thing you can do with regexps on HTML is heuristics but that will not work on every condition. It should be possible to present a HTML file that will be matched wrongly by any regular expression.

Reece answered 26/2, 2009 at 14:32 Comment(12)
Best answer so far. If it can only match regular grammars then we would need an infinitely large regexp to parse a context-free grammar like HTML. I love when these things have clear theoretical answers.Leckie
I assumed we were discussing Perl-type regexes where they aren't actually regular expressions.Haye
What is it that makes Perl-type regular expressions not actual regular expressions?Leckie
ntownsend: They can refer to previously-matched parts later in the regexp, among other things. I'm not entirely sure WHERE they end up in the automaton hierarchy, though.Mckinzie
Actually, .Net regular expressions can match opening with closing tags, to some extent, using balancing groups and a carefully crafted expression. Containing all of that in a regexp is still crazy of course, it would look like the great code Chtulhu and would probably summon the real one as well. And in the end it still won't work for all cases. They say that if you write a regular expression that can correctly parse any HTML the universe will collapse onto itself.Bernardabernardi
Some regex libs can do recursive regular expressions (effectively making them non-regular expressions :)Otilia
-1 This answer draws the right conclusion ("It's a bad idea to parse HTML with Regex") from wrong arguments ("Because HTML isn't a regular language"). The thing that most people nowadays mean when they say "regex" (PCRE) is well capable not only of parsing context-free grammars (that's trivial actually), but also of context-sensitive grammars (see #7434772).Bronchus
@OndraŽižka right, there is an example of a "regExp" that can parse XML porg.es/blog/… (short summary is here research.swtch.com/irregexp)Mariannemariano
It's not "possible to present a [single, as the sentence implies] HTML file that will be matched wrongly by any regular expression". But any given regular expression (that has no false positives) can only recognize HTML up to a fixed maximum level of nesting (the maximum level is fixed if you only care about correct balancing and assume that tagnames are stored in constant space; otherwise the maximum level is still limited). So one can create a HTML document that makes a given regex fail, but not one that makes all regexes fail.Jugglery
The reason is simple: DFA's (deterministic finite automata, the materialization of regular expressions, and thus the automata that recognize regular languages) have a number of states fixed "at compile time". By contrast, Pushdown automata (the automata that recognize context-free languages) additionally employ a stack of unbounded size.Jugglery
@StefanPochmann You are right, my answer was not precise enough and didn't state the fact that HTML is not a regular language. I believe that is clear from the context but nevertheless I should have mentioned.Reece
@JohannesWeiß Yeah, it was fairly clear what you meant, but still better to explicitly say it. Thanks for the edit, though I rewrote it a bit to put the important part first and put the explanation only in the following parentheses. Hope that's ok. I think it reads better.Iolaiolande
C
37

For quick´n´dirty regexp will do fine. But the fundamental thing to know is that it is impossible to construct a regexp that will correctly parse HTML.

The reason is that regexps can’t handle arbitarly nested expressions. See Can regular expressions be used to match nested patterns?

Casuist answered 26/2, 2009 at 14:32 Comment(1)
Some regex libs can do recursive regular expressions (effectively making them non-regular expressions :)Otilia
B
33

(From http://htmlparsing.com/regexes)

Say you've got a file of HTML where you're trying to extract URLs from <img> tags.

<img src="http://example.com/whatever.jpg">

So you write a regex like this in Perl:

if ( $html =~ /<img src="(.+)"/ ) {
    $url = $1;
}

In this case, $url will indeed contain http://example.com/whatever.jpg. But what happens when you start getting HTML like this:

<img src='http://example.com/whatever.jpg'>

or

<img src=http://example.com/whatever.jpg>

or

<img border=0 src="http://example.com/whatever.jpg">

or

<img
    src="http://example.com/whatever.jpg">

or you start getting false positives from

<!-- // commented out
<img src="http://example.com/outdated.png">
-->

It looks so simple, and it might be simple for a single, unchanging file, but for anything that you're going to be doing on arbitrary HTML data, regexes are just a recipe for future heartache.

Beekeeping answered 10/9, 2013 at 17:7 Comment(3)
This looks to be the real answer - while it's probably possible to parse arbitrary HTML with regex since todays regexes are more than just a finite automata, in order to parse arbitrary html and not just a concrete page you have to reimplement a HTML parser in regexp and regexes surely become 1000 times unreadable.Wil
Hey Andy, I took the time to come up with an expression that supports your mentioned cases. https://mcmap.net/q/25803/-using-regular-expressions-to-parse-html-why-not Let me know what you think! :)Scrutator
The reasoning in this answer is way outdated, and applies even less today than it did originally (which I think it didn't). (Quoting OP: "if you're just doing something simple, quick, or dirty...".)Carolynecarolynn
H
17

Two quick reasons:

  • writing a regex that can stand up to malicious input is hard; way harder than using a prebuilt tool
  • writing a regex that can work with the ridiculous markup that you will inevitably be stuck with is hard; way harder than using a prebuilt tool

Regarding the suitability of regexes for parsing in general: they aren't suitable. Have you ever seen the sorts of regexes you would need to parse most languages?

Haye answered 26/2, 2009 at 14:29 Comment(3)
Wow? A downvote after 2+ years? In case anyone was wondering, I didn't say "Because it's theoretically impossible" because the question clearly asked about "quick-and-dirty", not "correct". The OP clearly already read answers that covered the theoretically impossible territory and still wasn't satisfied.Haye
Have an upvote after 5+ years. :) As for why you might have received the downvote, I'm not qualified to say, but personally, I would have liked to see some examples, or explanation rather than the closing rhetorical question.Raddy
Essentially all quick-and-dirty html parsing that is done in shipping products or internal tools ends up being a gaping security hole, or a bug waiting to happen. It must be discouraged with gusto. If one can use a regex, one can use a proper html parser.Mortality
M
17

As far as parsing goes, regular expressions can be useful in the "lexical analysis" (lexer) stage, where the input is broken down into tokens. It's less useful in the actual "build a parse tree" stage.

For an HTML parser, I'd expect it to only accept well-formed HTML and that requires capabilities outside what a regular expression can do (they cannot "count" and make sure that a given number of opening elements are balanced by the same number of closing elements).

Mckinzie answered 26/2, 2009 at 14:34 Comment(0)
E
8

Because there are many ways to "screw up" HTML that browsers will treat in a rather liberal way but it would take quite some effort to reproduce the browser's liberal behaviour to cover all cases with regular expressions, so your regex will inevitably fail on some special cases, and that would possibly introduce serious security gaps in your system.

Eyeleteer answered 26/2, 2009 at 14:29 Comment(4)
Very true, the majority of HTML out there seems to be horrible. I don't understand how a failing regular expression can introduce serious security gaps. Can you give an example?Leckie
ntownsend: For instance, you think you have stripped all the script tags from the HTML but your regex fails cover a special case (that, let's say, only works on IE6): boom, you have an XSS vulerability!Eyeleteer
This was a strictly hypothetical example since most real world examples are too complicated to fit into these comments but you could find a few by quick googling on the subject.Eyeleteer
+1 for mentioning the security angle. When you're interfacing with the entire internet you can't afford to write hacky "works most of the time" code.Moujik
R
8

The problem is that most users who ask a question that has to do with HTML and regex do this because they can't find an own regex that works. Then one has to think whether everything would be easier when using a DOM or SAX parser or something similar. They are optimized and constructed for the purpose of working with XML-like document structures.

Sure, there are problems that can be solved easily with regular expressions. But the emphasis lies on easily.

If you just want to find all URLs that look like http://.../ you're fine with regexps. But if you want to find all URLs that are in a a-Element that has the class 'mylink' you probably better use a appropriate parser.

Resurge answered 26/2, 2009 at 14:30 Comment(0)
S
5

Regular expressions were not designed to handle a nested tag structure, and it is at best complicated (at worst, impossible) to handle all the possible edge cases you get with real HTML.

Shanaeshanahan answered 26/2, 2009 at 14:35 Comment(0)
A
5

I believe that the answer lies in computation theory. For a language to be parsed using regex it must be by definition "regular" (link). HTML is not a regular language as it does not meet a number of criteria for a regular language (much to do with the many levels of nesting inherent in html code). If you are interested in the theory of computation I would recommend this book.

Age answered 26/2, 2009 at 14:36 Comment(1)
I've actually read that book. It just didn't occur to me that HTML is a context-free language.Leckie
L
4

HTML/XML is divided into markup and content. Regex is only useful doing a lexical tag parse. I guess you could deduce the content. It would be a good choice for a SAX parser. Tags and content could be delivered to a user defined function where nesting/closure of elements can be kept track of.

As far as just parsing the tags, it can be done with regex and used to strip tags from a document.

Over years of testing, I've found the secret to the way browsers parse tags, both well and ill formed.

The normal elements are parsed with this form:

The core of these tags use this regex

 (?:
      " [\S\s]*? " 
   |  ' [\S\s]*? ' 
   |  [^>]? 
 )+

You'll notice this [^>]? as one of the alternations. This will match unbalanced quotes from ill-formed tags.

It is also, the single most root of all evil to regular expressions. The way it's used will trigger a bump-along to satisfy it's greedy, must-match quantified container.

If used passively, there is never a problem But, if you force something to match by interspersing it with a wanted attribute/value pair, and don't provide adequate protection from backtracking, it's an out of control nightmare.

This is the general form for just plain old tags. Notice the [\w:] representing the tag name? In reality, the legal characters representing the tag name are an incredible list of Unicode characters.

 <     
 (?:
      [\w:]+ 
      \s+ 
      (?:
           " [\S\s]*? " 
        |  ' [\S\s]*? ' 
        |  [^>]? 
      )+
      \s* /?
 )
 >

Moving on, we also see that you just can't search for a specific tag without parsing ALL tags. I mean you could, but it would have to use a combination of verbs like (*SKIP)(*FAIL) but still all tags have to be parsed.

The reason is that tag syntax may be hidden inside other tags, etc.

So, to passively parse all tags, a regex is needed like the one below. This particular one matches invisible content as well.

As new HTML or xml or any other develop new constructs, just add it as one of the alternations.


Web page note - I've never seen a web page (or xhtml/xml) that this
had trouble with. If you find one, let me know.

Performance note - It's quick. This is the fastest tag parser I've seen
(there may be faster, who knows).
I have several specific versions. It is also excellent as scraper
(if you're the hands-on type).


Complete raw regex

<(?:(?:(?:(script|style|object|embed|applet|noframes|noscript|noembed)(?:\s+(?>"[\S\s]*?"|'[\S\s]*?'|(?:(?!/>)[^>])?)+)?\s*>)[\S\s]*?</\1\s*(?=>))|(?:/?[\w:]+\s*/?)|(?:[\w:]+\s+(?:"[\S\s]*?"|'[\S\s]*?'|[^>]?)+\s*/?)|\?[\S\s]*?\?|(?:!(?:(?:DOCTYPE[\S\s]*?)|(?:\[CDATA\[[\S\s]*?\]\])|(?:--[\S\s]*?--)|(?:ATTLIST[\S\s]*?)|(?:ENTITY[\S\s]*?)|(?:ELEMENT[\S\s]*?))))>

Formatted look

 <
 (?:
      (?:
           (?:
                # Invisible content; end tag req'd
                (                             # (1 start)
                     script
                  |  style
                  |  object
                  |  embed
                  |  applet
                  |  noframes
                  |  noscript
                  |  noembed 
                )                             # (1 end)
                (?:
                     \s+ 
                     (?>
                          " [\S\s]*? "
                       |  ' [\S\s]*? '
                       |  (?:
                               (?! /> )
                               [^>] 
                          )?
                     )+
                )?
                \s* >
           )

           [\S\s]*? </ \1 \s* 
           (?= > )
      )

   |  (?: /? [\w:]+ \s* /? )
   |  (?:
           [\w:]+ 
           \s+ 
           (?:
                " [\S\s]*? " 
             |  ' [\S\s]*? ' 
             |  [^>]? 
           )+
           \s* /?
      )
   |  \? [\S\s]*? \?
   |  (?:
           !
           (?:
                (?: DOCTYPE [\S\s]*? )
             |  (?: \[CDATA\[ [\S\s]*? \]\] )
             |  (?: -- [\S\s]*? -- )
             |  (?: ATTLIST [\S\s]*? )
             |  (?: ENTITY [\S\s]*? )
             |  (?: ELEMENT [\S\s]*? )
           )
      )
 )
 >
Lauder answered 15/6, 2017 at 22:6 Comment(1)
Your Regexp malformedSining
L
3

There are definitely cases where using a regular expression to parse some information from HTML is the correct way to go - it depends a lot on the specific situation.

The consensus above is that in general it is a bad idea. However if the HTML structure is known (and unlikely to change) then it is still a valid approach.

Laliberte answered 29/4, 2011 at 6:45 Comment(0)
S
3

This expression retrieves attributes from HTML elements. It supports:

  • unquoted / quoted attributes,
  • single / double quotes,
  • escaped quotes inside attributes,
  • spaces around equals signs,
  • any number of attributes,
  • check only for attributes inside tags,
  • escape comments, and
  • manage different quotes within an attribute value.

(?:\<\!\-\-(?:(?!\-\-\>)\r\n?|\n|.)*?-\-\>)|(?:<(\S+)\s+(?=.*>)|(?<=[=\s])\G)(?:((?:(?!\s|=).)*)\s*?=\s*?[\"']?((?:(?<=\")(?:(?<=\\)\"|[^\"])*|(?<=')(?:(?<=\\)'|[^'])*)|(?:(?!\"|')(?:(?!\/>|>|\s).)+))[\"']?\s*)

Check it out. It works better with the "gisx" flags, as in the demo.

Scrutator answered 17/10, 2016 at 21:19 Comment(2)
That's very interesting. Not readable, probably hard to debug but still : Impressive work!Fairfield
This still vaguely assumes that the HTML is well-formed,. Without context matching, this will match apparent URLs in contexts where you typically don't want to match them, like in a piece of JavaScript code inside a <script> tag.Phebephedra
H
2

"It depends" though. It's true that regexes don't and can't parse HTML with true accuracy, for all the reasons given here. If, however, the consequences of getting it wrong (such as not handling nested tags) are minor, and if regexes are super-convenient in your environment (such as when you're hacking Perl), go ahead.

Suppose you're, oh, maybe parsing web pages that link to your site--perhaps you found them with a Google link search--and you want a quick way to get a general idea of the context surround your link. You're trying to run a little report that might alert you to link spam, something like that.

In that case, misparsing some of the documents isn't going to be a big deal. Nobody but you will see the mistakes, and if you're very lucky there will be few enough that you can follow up individually.

I guess I'm saying it's a tradeoff. Sometimes implementing or using a correct parser--as easy as that may be--might not be worth the trouble if accuracy isn't critical.

Just be careful with your assumptions. I can think of a few ways the regexp shortcut can backfire if you're trying to parse something that will be shown in public, for example.

Humanize answered 26/2, 2009 at 15:26 Comment(0)
F
2

Keep in mind that while HTML itself isn't regular, portions of a page you are looking at might be regular.

For example, it is an error for <form> tags to be nested; if the web page is working correctly, then using a regular expression to grab a <form> would be completely reasonable.

I recently did some web scraping using only Selenium and regular expressions. I got away with it because the data I wanted was put in a <form>, and put in a simple table format (so I could even count on <table>, <tr> and <td> to be non-nested--which is actually highly unusual). In some degree, regular expressions were even almost necessary, because some of the structure I needed to access was delimited by comments. (Beautiful Soup can give you comments, but it would have been difficult to grab <!-- BEGIN --> and <!-- END --> blocks using Beautiful Soup.)

If I had to worry about nested tables, however, my approach simply would not have worked! I would have had to fall back on Beautiful Soup. Even then, however, sometimes you can use a regular expression to grab the chunk you need, and then drill down from there.

Flyaway answered 12/2, 2013 at 18:34 Comment(0)
A
1

I tried my hand at a regex for this too. It's mostly useful for finding chunks of content paired with the next HTML tag, and it doesn't look for matching close tags, but it will pick up close tags. Roll a stack in your own language to check those.

Use with 'sx' options. 'g' too if you're feeling lucky:

(?P<content>.*?)                # Content up to next tag
(?P<markup>                     # Entire tag
  <!\[CDATA\[(?P<cdata>.+?)]]>| # <![CDATA[ ... ]]>
  <!--(?P<comment>.+?)-->|      # <!-- Comment -->
  </\s*(?P<close_tag>\w+)\s*>|  # </tag>
  <(?P<tag>\w+)                 # <tag ...
    (?P<attributes>
      (?P<attribute>\s+
# <snip>: Use this part to get the attributes out of 'attributes' group.
        (?P<attribute_name>\w+)
        (?:\s*=\s*
          (?P<attribute_value>
            [\w:/.\-]+|         # Unquoted
            (?=(?P<_v>          # Quoted
              (?P<_q>['\"]).*?(?<!\\)(?P=_q)))
            (?P=_v)
          ))?
# </snip>
      )*
    )\s*
  (?P<is_self_closing>/?)   # Self-closing indicator
  >)                        # End of tag

This one is designed for Python (it might work for other languages, haven't tried it, it uses positive lookaheads, negative lookbehinds, and named backreferences). Supports:

  • Open Tag - <div ...>
  • Close Tag - </div>
  • Comment - <!-- ... -->
  • CDATA - <![CDATA[ ... ]]>
  • Self-Closing Tag - <div .../>
  • Optional Attribute Values - <input checked>
  • Unquoted / Quoted Attribute Values - <div style='...'>
  • Single / Double Quotes - <div style="...">
  • Escaped Quotes - <a title='John\'s Story'>
    (this isn't really valid HTML, but I'm a nice guy)
  • Spaces Around Equals Signs - <a href = '...'>
  • Named Captures For Interesting Bits

It's also pretty good about not triggering on malformed tags, like when you forget a < or >.

If your regex flavor supports repeated named captures then you're golden, but Python re doesn't (I know regex does, but I need to use vanilla Python). Here's what you get:

  • content - All of the content up to the next tag. You could leave this out.
  • markup - The entire tag with everything in it.
  • comment - If it's a comment, the comment contents.
  • cdata - If it's a <![CDATA[...]]>, the CDATA contents.
  • close_tag - If it's a close tag (</div>), the tag name.
  • tag - If it's an open tag (<div>), the tag name.
  • attributes - All attributes inside the tag. Use this to get all attributes if you don't get repeated groups.
  • attribute - Repeated, each attribute.
  • attribute_name - Repeated, each attribute name.
  • attribute_value - Repeated, each attribute value. This includes the quotes if it was quoted.
  • is_self_closing - This is / if it's a self-closing tag, otherwise nothing.
  • _q and _v - Ignore these; they're used internally for backreferences.

If your regex engine doesn't support repeated named captures, there's a section called out that you can use to get each attribute. Just run that regex on the attributes group to get each attribute, attribute_name and attribute_value out of it.

Demo here: https://regex101.com/r/mH8jSu/11

Albescent answered 28/12, 2016 at 5:5 Comment(0)
R
0

Regular expressions are not powerful enough for such a language like HTML. Sure, there are some examples where you can use regular expressions. But in general it is not appropriate for parsing.

Rakia answered 26/2, 2009 at 14:33 Comment(0)
A
0

Actually, HTML parsing with regex is perfectly possible in PHP. You just have to parse the whole string backwards using strrpos to find < and repeat the regex from there using ungreedy specifiers each time to get over nested tags. Not fancy and terribly slow on large things, but I used it for my own personal template editor for my website. I wasn't actually parsing HTML, but a few custom tags I made for querying database entries to display tables of data (my <#if()> tag could highlight special entries this way). I wasn't prepared to go for an XML parser on just a couple of self created tags (with very non-XML data within them) here and there.

So, even though this question is considerably dead, it still shows up in a Google search. I read it and thought "challenge accepted" and finished fixing my simple code without having to replace everything. Decided to offer a different opinion to anyone searching for a similar reason. Also the last answer was posted 4 hours ago so this is still a hot topic.

Academia answered 12/2, 2013 at 22:56 Comment(4)
-1 for suggesting a TERRIBLE idea. Did you consider whitespace between the tag and the closing angle bracket? (E.g., <tag >) Did you consider commented-out closing tags? (E.g., <tag> <!-- </tag> -->) Did you consider CDATA? Did you consider inconsistent-case tags? (E.g., <Tag> </tAG>) Did you consider this as well?Prisoner
In the particular case of your few custom tags, yes, regular expressions work well. So it's not that your use of them was a mistake in your particular case. That's not HTML, though, and saying "HTML parsing with regex is perfectly possible in PHP" is just flat-out false, and a TERRIBLE idea. The inconsistencies of real HTML (and there are way more than the few I listed) are why you should never parse real HTML with regular expressions. See, well, all the other answers to this question, as well as the one I linked to in my other comment above.Prisoner
PHP is a turing-complete language, so it's not flat-out false at all. Everything computationally possible is possible, including parsing HTML. Spaces in tags were NEVER a problem and I've since adapted it to list tag elements in-order. My use automatically corrected tags with inconsistent casing, stripped commented stuff at the very first stage and after some later additions all sorts of tags can be easily added (though it's case-sensitive, by my own choice). And I'm pretty sure CDATA is actually an XML element, not a HTML one.Academia
My old method (that I described here) was pretty inefficient and I've recently started a re-write of a lot of the content editors. When it comes to doing these things, possibility isn't the issue; the best way is always the main concern. The real answer is "there's no EASY way to do it in PHP". NO ONE says there's no way to do it in PHP or that it's a terrible idea, but that it's impossible with regex, which I've honestly never tried, but the one major flaw in my answer is I assumed the question was referring to regex within the context of PHP, which is not necessarily the case.Academia
F
-1

You, know...there's a lot of mentality of you CAN'T do it and I think that everyone on both sides of the fence are right and wrong. You CAN do it, but it takes a little more processing than just running one regex against it. Take this (I wrote this inside of an hour) as an example. It assumes the HTML is completely valid, but depending on what language you're using to apply the aforementioned regex, you could do some fixing of the HTML to make sure that it will succeed. For example, removing closing tags that are not supposed to be there: </img> for example. Then, add the closing single HTML forward slash to elements that are missing them, etc.

I'd use this in the context of writing a library that would allow me to perform HTML element retrieval akin to that of JavaScript's [x].getElementsByTagName(), for example. I'd just splice up the functionality I wrote in the DEFINE section of the regex and use it for stepping inside of a tree of elements, one at time.

So, will this be the final 100% answer for validating HTML? No. But it's a start and with a little more work, it can be done. However, trying to do it inside of one regex execution is not practical, nor efficient.

Forney answered 22/11, 2015 at 15:3 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.