xmllint to parse a html file - McMap

About

xmllint to parse a html file

Asked 8/3, 2017 at 19:18 Answered 8/3, 2017 at 19:27

bash macos xpath xmllint

R

1

10

I was trying to parse out text between specific tags on a mac in various html files. I was looking for the first <H1> heading in the body. Example:

<BODY>
<H1>Dublin</H1>

Using regular expressions for this I believe is an anti pattern so I used xmllint and xpath instead.

xmllint --nowarning --xpath '/HTML/BODY/H1[0]'

Problem is some of the HTML files contain badly formed tags. So I get errors on the lines of

 parser error : Opening and ending tag mismatch: UL line 261 and LI
</LI>

Problem is I can't just do, 2>/dev/null as then I loose those files altogether. Is there any way, I can just use an XPath expression here and just say, relax if the XML isn't perfect, just give me the value between the first H1 headings?

Rina answered 8/3, 2017 at 19:18 Comment(0)

C

13

Try the --html option. Otherwise, xmllint parses your document as XML which is a lot stricter than HTML. Also note that XPath indices are 1-based and that HTML tags are converted to lowercase when parsing. The command

xmllint --html --xpath '/html/body/h1[1]' - <<EOF
<BODY>
<H1>Dublin</H1>
EOF

prints

<h1>Dublin</h1>

Cist answered 8/3, 2017 at 19:27 Comment(2)

I get even more mismatches when I do that. Instead of ./myfile.html:131: parser error : Opening and ending tag mismatch: UL line 127 and LI I get HTML parser error : Opening and ending tag mismatch: ul and td – Rina 8/3, 2017 at 19:40

@MoreThanFive libxml2's HTML parser isn't very forgiving. The --recover option might help in addition to --nowarning which you already discovered. – Cist 8/3, 2017 at 20:9

Recommended topics

#Godot #Unity #Godot 4.X #Mongodb

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

© 2022 - 2024 — McMap. All rights reserved.