How do you handle malformed HTML in Perl?
Asked Answered
G

3

5

I'm interested in a parser that could take a malformed HTML page, and turn it into well formed HTML before performing some XPath queries on it. Do you know of any?

Geologize answered 27/10, 2009 at 20:55 Comment(4)
Depends on what you are trying to do. I routinely parse tens of gigabytes of garbled HTML source without worrying about any of that.Knavery
How are you doing that? I tried to use XML::XPath in combination with LWP::UserAgent, and XML::XPath failed with a malformed error. Maybe you'd like to post your strategy as an answer.Geologize
The answer depends on the specific task at hand. Your question is too vague to give a specific answer. First, however, don't try to parse HTML as XML. Use an HTML parser.Knavery
You don't use XPath at all when parsing HTML? Don't you find it simpler to work with?Geologize
B
13

You should not use an XML parser to parse HTML. Use an HTML parser.

Note that the following is perfectly valid HTML (and an XML parser would choke on it):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
    "http://www.w3.org/TR/html4/strict.dtd">

<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Is this valid?</title>
</head>

<body>

<p>This is a paragraph

<table>

<tr>  <td>cell 1  <td>cell 2
<tr>  <td>cell 3  <td>cell 4

</table>

</body>

</html>

There are many task specific (in addition to the general purpose) HTML parsers on CPAN. They have worked perfectly for me on an immense variety of extremely messy (and most of the time invalid) HTML.

It would be possible to give specific recommendations if you can specify the problem you are trying to solve.

There is also HTML::TreeBuilder::XPath which uses HTML::Parser to parse the document into a tree and then allows you to query it using XPath. I have never used it but see Randal Schwartz's HTML Scraping with XPath.

Given the HTML file above, the following short script:

#!/usr/bin/perl

use strict; use warnings;

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new;

$tree->parse_file("valid.html");
my @td = $tree->findnodes_as_strings('//td');

print $_, "\n" for @td;

outputs:

C:\Temp> z
cell 1
cell 2
cell 3
cell 4

The key point here is that the document was parsed by an HTML parser as an HTML document (despite the fact that we were able to query it using XPath).

Befit answered 27/10, 2009 at 22:16 Comment(0)
A
1

Unless you're looking to learn more about wheels, use the HTML Tidy code.

Anet answered 27/10, 2009 at 21:2 Comment(2)
With the plethora of task-specific parser available to a Perl programmer, that is rarely necessary.Knavery
Its been 5 years since I last worked with Perl... guess its showing.Anet
R
1

You could rephrase the question like this:

I'm interested in a parser that could take a malformed HTML page C source, and turn it into well formed HTML C source before performing some XPath queries compilation and linking on it. Do you know of any?

Now the question may be a bit more obvious: it's not going to be easy. If it's truly malformed HTML, you may need to do the work by hand until it can be fed into an HTML parser. Then, you can use any of the other modules presented here to do the work. It's unlikely though that you could ever programatically translate raw HTML into strictly valid xhtml.

Respiratory answered 27/10, 2009 at 23:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.