Parsing of badly formatted HTML in PHP [closed]
Asked Answered
I

4

9

In my code I convert some styled xls document to html using openoffice. I then parse the tables using xml_parser_create. The problem is that openoffice creates oldschool html with unclosed <BR> and <HR> tags, it doesn't create doctypes and don't quote attributes <TABLE WIDTH=4>.

The php parsers I know off don't like this, and yield xml formatting errors. My current solution is to run some regexes over the file before I parse it, but this is neither nice nor fast.

Do you know a (hopefully included) php-parser, that doesn't care about these kinds of mistakes? Or perhaps a fast way to fix a 'broken' html?

Idiom answered 28/2, 2010 at 15:37 Comment(0)
L
9

A solution to "fix" broken HTML could be to use HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP.
HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant


An alternative idea might be to try loading your HTML with [`DOMDocument::loadHTML`][2] *(quoting)* :

The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load.

And if you're trying to load HTML from a file, see DOMDocument::loadHTMLFile.

Lair answered 28/2, 2010 at 15:40 Comment(5)
+1 for introduction htmlpurifier. one may look at simplehtmldom.sourceforge.net too.Gisborne
The purifier is nice, but feels like kinda overkill for the problem. Same thing goes for the DOMParser. Is it not correct, that it will require a lot more time and ram than a simple sax parser?Idiom
Maybe it will require more RAM, and possibly time ; but it will do more than a simple SAX parse, that would only read data, and not repair it ;;; and I'd say a SAX parser will only be able to read valid XML -- while HTMLPurifier and DOMDocument::loadHTML are both able to read "broken" HTML.Lair
Because my errors are always generated by the same engine, and thus fairly predictable, I've coded the parser using simple regex. I know about #1732848 and I am very thankful for pointing me to these two great tools.Idiom
If you can "predict" the errors, I guess that's OK :-) You're welcome :-)Lair
Z
4

There is SimpleHTML

For repairing broken HTML, you could use Tidy.

As an alternative you can use the native XML Reader. Because it is acts as a cursor going forward on the document stream and stopping at each node on the way, it will not break on invalid XML documents.

See http://www.ibm.com/developerworks/library/x-pullparsingphp.html

Zizith answered 28/2, 2010 at 15:40 Comment(1)
+1 for Tidy. I find it's more robust at it's job than SimpleHTML. 2 separate tools for 2 different jobs really.Sosanna
I
1

Any particular reason you're still using the PHP 4 XML API?

If you can get away with using PHP 5's XML API, there are two possibilities.

First, try the built-in HTML parser. It's really not very good (it tends to choke on poorly formatted HTML), but it might do the trick. Have a look at DomDocument::LoadHTML.

Second option - you could try the HTML parser based on the HTML5 parser specification:

http://code.google.com/p/html5lib/

This tends to work better than the built-in PHP HTML parser. It loads the HTML into a DomDocument object.

Internecine answered 28/2, 2010 at 16:27 Comment(1)
I'd rather not use a dom parser, as the document is quite big. (And I've already written tons of code for the sax)Idiom
C
0

A solution is to use DOMDocument.

Example :

$str = "
<html>
 <head>
  <title>test</title>
 </head>
 <body>
  </div>error.
  <p>another error</i>
 </body>
</html>
";

$doc = new DOMDocument();
@$doc->loadHTML($str);
echo $doc->saveHTML();

Advantage : natively included in PHP, contrary to PHP Tidy.

Coffeecolored answered 11/1, 2017 at 10:34 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.