How can I prevent XML::XPath from fetching a DTD while processing an XML file?
Asked Answered
C

3

12

My XML (a.xhtml) starts like this

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
...

My code starts like this

use XML::XPath;

use XML::XPath::XMLParser;

my $xp = XML::XPath->new(filename => "a.xhtml");

my $nodeset = $xp->find('/html/body//table'); 

It's very slow, and it turns out that it spends a lot of time getting the DTD (http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd).

Is there a way to explicitly declare an HTTP proxy server in the Perl XML:: family? I hate to modify the original a.xhtml document like having a local copy of the DTD.

Caraway answered 19/11, 2008 at 21:52 Comment(0)
M
15

XML::XPath is based on XML::Parser. There is an option in XML::Parser to NOT use LWP to resolve external entities (such as DTDs). And XML::XPath lets you pass an XML::Parser objetc, to use as the parser.

So you can write this:

my $p = XML::Parser->new( NoLWP => 1);
my $xp= XML::XPath->new( parser => $p, filename => "a.xhtml");

Note that in this case you will loose all entities except numerical ones and the default ones (>, <, &, ' and "). The parser will not complain, but they will disappear silently (try including &alpha; in the table and printing it for example).

As a matter of fact you probably should not use XML::XPath, which is not actively maintained.

Try XML::LibXML, if you have no problem with installing libxml2, its interface is very similar to XML::XPath as they both implement the DOM. XML::LibXML is also much more powerful than XML::XPath, and faster to boot. If you want an expat/XML::Parser based module, they you might want to have a look at XML::Twig (that's blatant self-promotion as I am the author of the module, sorry). Also for HTML/dodgy XHTML, you can use HTML::TreeBuilder, which, with the addition of HTML::TreeBuilder::XPath (also by me), supports XPath.

Mossbunker answered 20/11, 2008 at 10:8 Comment(0)
L
3

porneL's response seems to be the Right Thing here. (www.w3.org has started taking 30 seconds to respond to each of my queries (when it doesn't just give up), and when XML::XPath ends up retrieving the full XHTML set…!) Further, mirod's idea works, too:

use XML::XPath;
use XML::Catalog;

my $parser = new XML::Parser;
my $catalog_handler = new XML::Catalog("xhtml1-20020801/DTD/xhtml.soc")->get_handler($parser);
$parser->setHandlers("ExternEnt" => $catalog_handler);
my $xp = new XML::XPath(xml => $xml, parser => $parser);

Add a copy of "The complete set of DTD files together with an XML declaration and SGML Open Catalog" from ⟨URL:http://www.w3.org/TR/xhtml1/dtds.html⟩ and enjoy!

Luetic answered 8/3, 2011 at 2:26 Comment(0)
F
1

Usually it's done by setting up local XML catalog.

libxml-based parsers support it, so if you follow mirod's advice, you'll be able to get named entities and validation work without network access.

Fiance answered 19/11, 2008 at 22:22 Comment(1)
True. You could probably use XML::Catalog to add a catalog to an XML::Parser object, and use that parser in XML::XPath's new. I have never tested that though.Mossbunker

© 2022 - 2024 — McMap. All rights reserved.