Can XPath and XQuery work on HTML documents?
Asked Answered
D

5

6

I heard that an HTML document is not a XML document from https://stackoverflow.com/a/39560454.

XPath and XQuery work on XML documents. Can they work on HTML documents, and why?

Although I don't know why, I guess XPath can work on HTML documens, because of https://www.quora.com/Why-do-we-use-XPath-in-Selenium-even-though-CSS-Selector-is-faster and https://html-agility-pack.net/

Donets answered 23/4, 2019 at 22:18 Comment(0)
F
8

XQuery and XPath are defined to work on a particular data model called XDM. In XPath 1.0 this is described within the XPath specification; in XQuery and later XPath versions it is defined in a separate specification. XPath and XQuery can work on any data for which a mapping to XDM is defined. The XML and HTML DOM both differ in a number of details from XDM, but it is possible (with a bit of pragmatism) to define a mapping to XDM, and therefore XPath can be made to run against both XML and HTML DOMs. And indeed, both these mappings are very widely used, even though they are imperfect and in some cases inefficient.

The biggest problem with the HTML mapping to XDM is namespaces; XPath implementations traditionally regard HTML elements such as "table" and "p" as being in no namespace, so paths such as //table//p can be used, without namespace prefixes. But in HTML5, the WhatWG decided that these elements are in the XHTML namespace, which meant that they had to define a variation to the XPath spec to accommodate such paths.

CSS selectors have slowly acquired much of the expressive power of XPath 1.0, though they are certainly not as rich as later versions, and since they are designed primarily for HTML rather than XML, they can sometimes be more convenient to use. I haven't seen any performance data, but the browser vendors have by necessity put a lot of effort into making CSS fast, and they seem to have done almost zero development on their XPath implementations in the last 15 years, so it certainly wouldn't surprise me if CSS is faster in most browsers. The differences between DOM and XDM also create overheads: notably the very inefficient representation of namespaces in DOM.

Finland answered 23/4, 2019 at 23:5 Comment(9)
Thanks. I guess you have already mentioned in your post. May I ask again: are XPath and XQuery the recommended tools to work on HTML documents, or are some alternative tools recommended?Donets
Recommendations for choice of technology are off-topic for StackOverflow, for the very good reason that no professional consultant would advise on a choice without a much deeper study of your project requirements and constraints than you can give in a couple of paragraphs. Any recommendations you get here will be someone's personal opinion, and will generally be poorly-informed.Finland
I don't have specific projects, I don't mind your recommendations, and recommendation is okay in comments, so please don't worry. Thanks in advance.Donets
@Tim: The absence of specific requirements actually makes it harder, not easier, to make recommendations because it then becomes a burden on the responsible recommender to describe in general terms the space and the criteria for selecting among candidates. The context of comments clearly helps nothing in this regard, but nice try.Agrippina
@MichaelKay: Thanks for taking the time to elaborate on XPath's data model co-evolution with XML and HTML DOM.Agrippina
@Agrippina Thanks. Please don't worry either. Thanks in advance for any recommendation. Don't worry to be accurate, and I won't bite. I just wonder if it is worth to learn XPath and XQuery, or they are replaced by something else. Consider that XML has lost its application areas to JSON (and HTML maybe).Donets
@Tim: JSON for data; XML for documents. But I'll follow you no further down that rabbit hole than the tool recommendation one (other than to say generally, yes, XPath is worth learning).Agrippina
@Agrippina Is XQuery worth learning? If one only works on JSON and HTML, how transferable is the knowledge about XPath and XQuery? I am self learning a book named Database System Concepts, which has a chapter on XML, XPath and XQuery, and I am thinking if it is worth spending time on them.Donets
@Donets we can't possibly know what technologies are in use in your neck of the woods or in any neck of the woods that your career will take you into in the future. So we can't offer career advice. Suffice it to say that despite XML not being as fashionable as it was, there's still a vast amount of the stuff flying around the ether. (In fact, a successful technology only really starts declining about 20 years after it's ceased to be sexy).Finland
P
5

HTML doesn't guarantee well-formedness, so an XML parser will likely fail to parse it (unless you are using a very limited subset of HTML). However, XHTML is the well-formed cousin of HTML, and as far as I know works in browsers with the same feature-set (see: https://www.w3.org/TR/html-polyglot/).

But if you already have HTML, then you will need to convert it into XML to use XPath/XQuery. There are various implementations of "HTML tidy" with the option to output valid XML that should work. Some form of tidy is probably available in your XQuery processor. If not, there are many languages and standalone implementations that can probably get you there.

Polyglot answered 23/4, 2019 at 22:31 Comment(2)
Thanks. (1) can every HTML document be converted to an equivalent XHTML document? (2) Although I don't know why, I guess XPath can work on HTML documens, because of quora.com/… and html-agility-pack.net. Is there really a need to convert HTML to XHTML in order to use XPath?Donets
@Donets 1) I suspect there are exceptions, but I think generally, yes. 2) Web browsers parse HTML and build a DOM similar to XML, so you can XPath HTML in JavaScript, but not using XQuery or XSLT. Whether that will work for you depends on your application.Polyglot
C
3

The EXPath W3C Community has a specification for an HTTP Client module accessible from XPath and XQuery implementations that performs "tidying" of HTML content. See http://expath.org/spec/http-client#d2e517 for the section of the specification that describes this:

If the media type is an HTML type, the content is tidied up and parsed (this process is implementation-dependent) and the item is the resulting document node.

Now, you might consider it a bit roundabout to bring HTTP into the question of querying HTML, but it is quite natural that one might want to query or traverse HTML documents retrieved via HTTP. It also conforms to the spirit here of being processor-agnostic.

The following code sample is standard XQuery that will work on any XPath or XQuery implementation that supports the EXPath HTTP Client. It demonstrates how one can retrieve an HTML5 document (here, the HTML5 specification itself, whose un-closed tags like <meta> make it non-well formed XML) and query it via an XPath expression:

xquery version "3.1";

declare namespace html = "http://www.w3.org/1999/xhtml";

import module namespace http = "http://expath.org/ns/http-client";

let $url := "https://www.w3.org/TR/html5/"
return
    if (doc-available($url)) then 
        "The URL was well-formed XML. No tidying required. :)"
    else
        let $response := http:send-request(<http:request href="{$url}" method="GET"/>)
        let $response-head := $response[1]
        let $response-body := $response[2]
        return
            if (
                $response-head/http:body/@media-type eq "text/html" 
                and $response-body instance of document-node()
            ) then
                "The URL was an HTML document that was tidied into a " 
                || "well-formed XML document. :) For example: " 
                || $response-body//html:meta => head() => serialize() 
            else
                "The HTTP Client wasn't able to parse the result "
                || "into a well-formed XML document. :("

This returns:

The URL was an HTML document that was tidied into a well-formed XML document. :) 
For example: 
    <html:meta 
        xmlns:html="http://www.w3.org/1999/xhtml" 
        http-equiv="Content-Type" 
        content="text/html; charset=utf-8"/>

Notice that this <meta> element is well-formed XML and was produced by the XPath expression //html:meta. (I tested this in eXist. The same code works in BaseX, except that the expression is //meta, since BaseX doesn't coerce the tidied HTML into the HTML namespace as eXist does.)

I should add that the HTTP Client specification leaves it to processors to define "tidying", so surely there will be variation from one implementation to another, but if the question is "Can XPath and XQuery work on HTML documents?", this demonstrates that they can, and they can do so only processor-agnostic specifications—with the caveat proven here that different implementations might interpret the spec differently.

Canaveral answered 24/4, 2019 at 3:59 Comment(0)
P
2

When I wanted to use XPath (newer than XPath 1.0) on an HTML document, I wrote a complete XQuery interpreter for HTML.

Besides standard XQuery 3.0 I have added some optional extensions (that are not actually allowed, but useful for HTML) like matching node names case-insensitively or being more relaxed with namespaces.

Perron answered 25/4, 2019 at 13:36 Comment(2)
Thanks. May I ask advantages and disadvantages of using Xidel versus BaseX (maybe also versus Saxon)?Donets
Xidel is much smaller than basex/saxon and written in Pascal. So it needs less space. On small queries it should be faster, because it can finish the query evaluation before the Java VM for the others has started; but on longer running queries it becomes slower because it does not have many optimizations.Perron
B
1

Indeed Xpath can be used against an html document. Some examples of packages/modules/applications doing that

  • Selenium driver
  • lxml on python (based on libxml2)
  • xmllint on bash (based on libxml2)
Bluegrass answered 23/4, 2019 at 23:51 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.