Using xmllint and xpath with a less-than-perfect HTML document?
Asked Answered
S

3

17

I have an HTML page that is generated by an existing tool - I cannot change the output of this tool.

However, I want to use xmllint with the --xpath option to pick out a few specific pieces of information from the downloaded webpage. The problem is that the page starts with:

<html lang=en><head>...

And xmllint throws errors nearly immediately:

html.out:2: parser error : AttValue: " or ' expected
<html lang=en><head>
           ^

The issue certainly seems to be the missing enclosing quotation marks around the value of the lang attribute. The entire page is full of this kind of issue. (Though only sporadically.)

Nearly every browser can parse this just fine - how can I convince xmllint to do so as well? I would like to avoid having to inject an intermediate step to "fix" the file. Instead, I would like to either:

1) Find a flag, validation option, etc. that helps the parser along, or:

2) Use some other tool. (But what? xmllint is always my go-to for command line XPath commands.)

Further, using just xpath results in:

> xpath html.out '//myquery...'

not well-formed (invalid token) at line 2, column 11, ...
Strophanthin answered 31/1, 2014 at 12:14 Comment(2)
XML is not HTML and viceversa. Beware!Reeder
@StefanoSanfilippo This is definitely true! I'm reminded of this: https://mcmap.net/q/17499/-regex-match-open-tags-except-xhtml-self-contained-tags However, in this case I'm just looking for a one-liner that will work, not one that will stand the test of time and make my future children proud.Strophanthin
R
24

You can enable the HTML parser in xmllint using the --html command line option. That way, you will be able to process HTML documents.

Reeder answered 31/1, 2014 at 12:26 Comment(1)
This does work online in few cases. I wasn't able to find any other switching for making xmllint ignore non-standard HTML files, like browser.Preform
N
9

If does not abort the parsing, you can just hide the errors with:

2>/dev/null

Then there is Xidel, which I made just for picking some data from html pages. (although it is not perfect. I was told about two malformed documents it could not handle)

xidel html.out -e //yourquery...
Nuncia answered 31/1, 2014 at 12:33 Comment(0)
G
6

You should pre-process the HTML with a lenient parser. (That's the main difference: HTML is allowed a much more lax syntax than XML.) That is, try HTML5-Tidy and let XMLLint work on the result:

input HTML
 |
 v
Tidy
 |
 v
xmllint
 |
 v
result
Grapher answered 31/1, 2014 at 12:26 Comment(2)
would xmlstarlet work as well as xmllint for parsing the file tidy creates?Soutor
In principle, Tidy's result should be well-formed XML, so you can throw it at any XML processing tool chain with the expectation to Just Work™.Grapher

© 2022 - 2024 — McMap. All rights reserved.