Can you provide some examples of why it is hard to parse XML and HTML with a regex? [closed]
Asked Answered
G

12

414

One mistake I see people making over and over again is trying to parse XML or HTML with a regex. Here are a few of the reasons parsing XML and HTML is hard:

People want to treat a file as a sequence of lines, but this is valid:

<tag
attr="5"
/>

People want to treat < or <tag as the start of a tag, but stuff like this exists in the wild:

<img src="imgtag.gif" alt="<img>" />

People often want to match starting tags to ending tags, but XML and HTML allow tags to contain themselves (which traditional regexes cannot handle at all):

<span id="outer"><span id="inner">foo</span></span> 

People often want to match against the content of a document (such as the famous "find all phone numbers on a given page" problem), but the data may be marked up (even if it appears to be normal when viewed):

<span class="phonenum">(<span class="area code">703</span>)
<span class="prefix">348</span>-<span class="linenum">3020</span></span>

Comments may contain poorly formatted or incomplete tags:

<a href="foo">foo</a>
<!-- FIXME:
    <a href="
-->
<a href="bar">bar</a>

What other gotchas are you aware of?

Gutta answered 31/3, 2009 at 14:13 Comment(6)
Web browsers make sense of this kind of mess millions of times a second, can't somebody create a web page parser class for us mere mortals?Molest
Jon, they have. In Perl there are many HTML::Parser, HTML::TreeBuilder, etc. There is almost certainly one for your langauge.Gutta
Jon, what language are you looking for, and are you looking for parsing well-formed XML, or HTML tag soup that you get on the web?Southpaw
As a reference, see this for a Javascript parser. In most other languages, a parser is relatively easy to find.Diaphane
The best answer is, https://mcmap.net/q/17499/-regex-match-open-tags-except-xhtml-self-contained-tags (Beware Zalgo)Sonnier
Here’s a good explanation of how you certainly can parse HTML with patterns, as well as why you probably do not wish to do so.Amero
P
270

Here's some fun valid XML for you:

<!DOCTYPE x [ <!ENTITY y "a]>b"> ]>
<x>
    <a b="&y;>" />
    <![CDATA[[a>b <a>b <a]]>
    <?x <a> <!-- <b> ?> c --> d
</x>

And this little bundle of joy is valid HTML:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd" [
    <!ENTITY % e "href='hello'">
    <!ENTITY e "<a %e;>">
]>
    <title>x</TITLE>
</head>
    <p id  =  a:b center>
    <span / hello </span>
    &amp<br left>
    <!---- >t<!---> < -->
    &e link </a>
</body>

Not to mention all the browser-specific parsing for invalid constructs.

Good luck pitting regex against that!

EDIT (Jörg W Mittag): Here is another nice piece of well-formed, valid HTML 4.01:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  "http://www.w3.org/TR/html4/strict.dtd"> 
<HTML/
  <HEAD/
    <TITLE/>/
    <P/>
Parimutuel answered 31/3, 2009 at 17:47 Comment(9)
I have no idea what is going on in the first example, could you add some explanatory text?Gutta
The XML one? There are a few different constructs there, which is troublesome? The DTD internal subset? That's defining a new &entity; called ‘y’, containing a ‘]>’ sequence that would normally, if not in quotes, end the internal subset.Parimutuel
(This demonstrates that you have to have quite deep knowledge about some of the more esoteric and archaic DTD features of XML to parse a document properly, even if you aren't a DTD-validating parser.)Parimutuel
The HTML examples make use of a rarely known feature: shorttags. Read more at w3.org/QA/2007/10/shorttags.htmlBrigittebriley
Every time someone writes HTML as shown above Tim Berners-Lee sheds a single tear.Oligosaccharide
I love how Stackoverflow's Syntax highlighter fails on the 1st occurence of "]".Dorser
@Dorser because the Prettifyer is based on regex :P so is GeSHiMitchellmitchem
How on earth can the third example be valid? Even the angle brackets are unbalanced, let alone whole tags.Bobsled
@dorukayhan: HTML4 was formally based on SGML (even if browsers didn't parse it that way). SGML's rules are largely bonkers. Those unclosed elements are actually NETsParimutuel
P
73

Actually

<img src="imgtag.gif" alt="<img>" />

is not valid HTML, and is not valid XML either.

It is not valid XML because the '<' and '>' are not valid characters inside attribute strings. They need to be escaped using the corresponding XML entities &lt; and &gt;

It is not valid HTML either because the short closing form is not allowed in HTML (but is correct in XML and XHTML). The 'img' tag is also an implicitly closed tag as per the HTML 4.01 specification. This means that manually closing it is actually wrong, and is equivalent to closing any other tag twice.

The correct version in HTML is

<img src="imgtag.gif" alt="&lt;img&gt;">

and the correct version in XHTML and XML is

<img src="imgtag.gif" alt="&lt;img&gt;"/>

The following example you gave is also invalid

<
tag
attr="5"
/>

This is not valid HTML or XML either. The name of the tag must be right behind the '<', although the attributes and the closing '>' may be wherever they want. So the valid XML is actually

<tag
attr="5"
/>

And here's another funkier one: you can actually choose to use either " or ' as your attribute quoting character

<img src="image.gif" alt='This is single quoted AND valid!'>

All the other reasons that were posted are correct, but the biggest problem with parsing HTML is that people usually don't understand all the syntax rules correctly. The fact that your browser interprets your tagsoup as HTML doesn't means that you have actually written valid HTML.

Edit: And even stackoverflow.com agrees with me regarding the definition of valid and invalid. Your invalid XML/HTML is not highlighted, while my corrected version is.

Basically, XML is not made to be parsed with regexps. But there is also no reason to do so. There are many, many XML parsers for each and every language. You have the choice between SAX parsers, DOM parsers and Pull parsers. All of these are guaranteed to be much faster than parsing with a regexp and you may then use cool technologies like XPath or XSLT on the resulting DOM tree.

My reply is therefore: not only is parsing XML with regexps hard, but it is also a bad idea. Just use one of the millions of existing XML parsers, and take advantage of all the advanced features of XML.

HTML is just too hard to even try parsing on your own. First the legal syntax has many little subtleties that you may not be aware of, and second, HTML in the wild is just a huge stinking pile of (you get my drift). There are a variety of lax parser libraries that do a good job at handling HTML like tag soup, just use these.

Paganini answered 31/3, 2009 at 14:26 Comment(6)
You don't need to escape > as > though.Mccloud
Okay, s/valid/exists in the wild/gGutta
Actually, according to the specification you must escape > as > just as you must escape < as < & and &amp; and in attributes " as &quot; and ' as &apos; it's just that many parserPaganini
opps, forgot the finish my comment. It's just that many parsers will be able to recover from their error state if < is properly encoded. Again, not crashing your parser doesn't mean your XML is valid.Paganini
The specification does not say ‘>’ must be escaped — except for the special case of the sequence ‘]]>’ in content. For this reason it is easiest to always escape ‘>’, but it is not required by spec.Parimutuel
> sign is perfectly valid in html #95028Reifel
K
62

I wrote an entire blog entry on this subject: Regular Expression Limitations

The crux of the issue is that HTML and XML are recursive structures which require counting mechanisms in order to properly parse. A true regex is not capable of counting. You must have a context free grammar in order to count.

The previous paragraph comes with a slight caveat. Certain regex implementations now support the idea of recursion. However once you start adding recursion into your regex expressions, you are really stretching the boundaries and should consider a parser.

Kleist answered 31/3, 2009 at 14:18 Comment(0)
C
22

One gotcha not on your list is that attributes can appear in any order, so if your regex is looking for a link with the href "foo" and the class "bar", they can come in any order, and have any number of other things between them.

Coinage answered 1/4, 2009 at 5:58 Comment(1)
Ah, yes, that was even the question that prompted me to ask this one (the first link).Gutta
G
17

It depends on what you mean by "parsing". Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

Glairy answered 31/3, 2009 at 14:16 Comment(3)
i guess backreferences can solve problem of open and close tagsUnhandy
@RishulMatta: how? You have only a limited number of backreferences and note you need to reverse the tags... Furthermore the strict definition of regexes does not allow backreferences.Argil
.NET allows for balancing expressions, which pop and push, and could theoretically be used for matching the hierarchy. But it's still a bad idea.Proudman
H
10

Are people actually making a mistake by using a regex, or is it simply good enough for the task they're trying to achieve?

I totally agree that parsing html and xml using a regex is not possible as other people have answered.

However, if your requirement is not to parse html/xml but to just get at one small bit of data in a "known good" bit of html / xml then maybe a regular expression or even an even simpler "substring" is good enough.

Hendon answered 31/3, 2009 at 14:29 Comment(6)
Define "good enough". Inevitably the simple regex won't work. Is not matching something or matching something you shouldn't a bug? If so then using regexes is a mistake. HTML and XML parsers are not hard to use. Avoiding learning them is a false economy.Gutta
ok, define "good enough". Lets say I have a webpage that tells me the clients IP address. That's all it does. Now, I need to write an application for the clients machine that tells me its IP address. I go to that site, look for an IP address and return it. Parsing the HTML is not needed!Hendon
If you have an arbitrary string whose format is completely under your control, the fact that the string happens to be well-formed XML really isn't relevant. But almost no use cases for XML actually fall into this category.Epaminondas
I can tell you from painful experience that most of the time it's possible to get what you want utilizing absurd complex regex patterns. Until the website undergoes a hilarious small change and you can throw this regex that made you cry for two days out of the window and start anew.Placeman
@Robert: "almost no use cases" is an exaggeration. In my experience there are common-enough use cases. YAGNI applies here... sometimes. The trick is knowing how bulletproof and long-lived your solution needs to be, for the particular task you're addressing. Robin has a good point. He is only saying that full XML parsing is not always worth it... which is true even if you know how to use it.Cartage
It's not an exaggeration in mine. In order to safely use regexes on XML, you either need to handle an amazing array of special cases (e.g. the arbitrary ordering of attributes, the proper use of delimiters for attribute values, places where whitespace does and does not need to be preserved, etc. ad inf.) or know for certain that those cases don't apply (e.g. know that your XML doesn't use attributes, or that whatever produces it always uses apostrophes to delimit attribute values, etc. ad inf.). And if you know that those things apply in your case, well, why were you using XML again?Epaminondas
V
7

I'm tempted to say "don't re-invent the wheel". Except that XML is a really, really complex format. So maybe I should say "don't reinvent the synchrotron."

Perhaps the correct cliche starts "when all you have is a hammer..." You know how to use regular expressions, regular expression are good at parsing, so why bother to learn an XML parsing library?

Because parsing XML is hard. Any effort you save by not having to learn to use an XML parsing library will be more than made up by the amount of creative work and bug-swatting you will have to do. For your own sake, google "XML library" and leverage somebody else's work.

Vicarious answered 14/7, 2012 at 23:52 Comment(3)
It's not as complex as C++ though.Mitchellmitchem
@Cole"Cole9"Johnson I wouldn't use REs to parse C++ either.Vicarious
If XML is a synchrotron, C++ would be the Large Hadron Collider.Sapphirine
C
6

People normally default to writing greedy patterns, often enough leading to an un-thought-through .* slurping large chunks of file into the largest possible <foo>.*</foo>.

Cleveland answered 31/3, 2009 at 14:20 Comment(1)
As well as making the repetition lazy with .*?<, you could fix that by using a negated character class like [^<]*<. (Disclaimer: obviously that’s still not foolproof, which is the point of the question.)Raindrop
G
6

I think the problems boil down to:

  1. The regex is almost invariably incorrect. There are legitimate inputs which it will fail to match correctly. If you work hard enough you can make it 99% correct, or 99.999%, but making it 100% correct is almost impossible, if only because of the weird things that XML allows by using entities.

  2. If the regex is incorrect, even for 0.00001% of inputs, then you have a security problem, because someone can discover the one input that will break your application.

  3. If the regex is correct enough to cover 99.99% of cases then it is going to be thoroughly unreadable and unmaintainable.

  4. It's very likely that a regex will perform very badly on moderate-sized input files. My very first encounter with XML was to replace a Perl script that (incorrectly) parsed incoming XML documents with a proper XML parser, and we not only replaced 300 lines of unreadable code with 100 lines that anyone could understand, but we improved user response time from 10 seconds to about 0.1 seconds.

Gadget answered 22/10, 2015 at 10:26 Comment(0)
D
5

I believe this classic has the information you are looking for. You can find the point in one of the comments there:

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

Some more info from Wikipedia: Chomsky Hierarchy

Deadly answered 12/1, 2012 at 22:47 Comment(1)
"Regular expression" doesn't have exactly the same meaning in formal grammar discussions as it does here. Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs). Some regex engines (such as Perl's) are Turing complete. It's true that even those are poor tools for parsing HTML, but this oft-cited argument is not the reason why.Deon
F
1

I gave a simplified answer to this problem here. While it doesn't account for the 100% mark, I explain how it's possible if you're willing to do some pre-processing work.

Fordo answered 22/11, 2015 at 15:12 Comment(0)
B
0

Generally speaking, XML cannot be parsed using regex since XML grammar is by no means regular. To put it simply, regexes cannot count (well, Perl regexes might actually be able to count things) so you cannot balance open-close tags.

I disagree. If you will use recursive in regex, you can easily find open and close tags.

Here I showed example of regex to avoid parsing errors of examples in first message.

Beware answered 6/2, 2015 at 16:7 Comment(2)
First, recursive regexes aren't regular expressions (if you look in the parenthesis, you will see that I concede that Perl's regexes, which are recursive, can count things, which is required to handle HTML). Second, your example is for XHTML or XML that is well formed. HTML is not well formed. Third, you have to ask yourself, is it easier to extend and maintain a parser written in a recursive regex language or a general purpose programming language.Gutta
Fourth, even your example is trivially broken while still being valid XML. Add one space between content_block and id and it fails. I am certain if I spent a few more minutes I would find some other structural error in your code. It just isn't a good idea.Gutta

© 2022 - 2024 — McMap. All rights reserved.