The EXPath W3C Community has a specification for an HTTP Client module accessible from XPath and XQuery implementations that performs "tidying" of HTML content. See http://expath.org/spec/http-client#d2e517 for the section of the specification that describes this:
If the media type is an HTML type, the content is tidied up and parsed (this process is implementation-dependent) and the item is the resulting document node.
Now, you might consider it a bit roundabout to bring HTTP into the question of querying HTML, but it is quite natural that one might want to query or traverse HTML documents retrieved via HTTP. It also conforms to the spirit here of being processor-agnostic.
The following code sample is standard XQuery that will work on any XPath or XQuery implementation that supports the EXPath HTTP Client. It demonstrates how one can retrieve an HTML5 document (here, the HTML5 specification itself, whose un-closed tags like <meta>
make it non-well formed XML) and query it via an XPath expression:
xquery version "3.1";
declare namespace html = "http://www.w3.org/1999/xhtml";
import module namespace http = "http://expath.org/ns/http-client";
let $url := "https://www.w3.org/TR/html5/"
return
if (doc-available($url)) then
"The URL was well-formed XML. No tidying required. :)"
else
let $response := http:send-request(<http:request href="{$url}" method="GET"/>)
let $response-head := $response[1]
let $response-body := $response[2]
return
if (
$response-head/http:body/@media-type eq "text/html"
and $response-body instance of document-node()
) then
"The URL was an HTML document that was tidied into a "
|| "well-formed XML document. :) For example: "
|| $response-body//html:meta => head() => serialize()
else
"The HTTP Client wasn't able to parse the result "
|| "into a well-formed XML document. :("
This returns:
The URL was an HTML document that was tidied into a well-formed XML document. :)
For example:
<html:meta
xmlns:html="http://www.w3.org/1999/xhtml"
http-equiv="Content-Type"
content="text/html; charset=utf-8"/>
Notice that this <meta>
element is well-formed XML and was produced by the XPath expression //html:meta
. (I tested this in eXist. The same code works in BaseX, except that the expression is //meta
, since BaseX doesn't coerce the tidied HTML into the HTML namespace as eXist does.)
I should add that the HTTP Client specification leaves it to processors to define "tidying", so surely there will be variation from one implementation to another, but if the question is "Can XPath and XQuery work on HTML documents?", this demonstrates that they can, and they can do so only processor-agnostic specifications—with the caveat proven here that different implementations might interpret the spec differently.