Parse an HTML fragment whitelisting some custom tags

About

Asked 29/3, 2016 at 8:2 Answered 29/3, 2016 at 10:27

I'm trying to parse an HTML fragment that contains a custom HTML tag using Nokogiri.

Example:

string = "<div>hello</div>\n<custom-tag></custom-tag>"

I tried to load it in many ways, but none is optimal.

If I use Nokogiri::HTML:

doc = Nokogiri::HTML(string)

When I use to_html, it adds a doctype and an html tag that wraps the content. It's undesired.

If I use Nokogiri::XML:

doc = Nokogiri::XML(string)

I got Error at line 2: Extra content at the end of the document, since in XML there must be a root tag that wraps all the document content. If I try to save this content again, The output is <div>hello</div> (every tag after the first is removed)

I tried also doc = Nokogiri::HTML.fragment:

doc = Nokogiri::HTML.fragment(string)

But it complains about the custom-tag.

How can I make Nokogiri parse correctly with this HTML fragment?

Arak answered 29/3, 2016 at 8:2 Comment(6)

what is your expected result? – Savagism 29/3, 2016 at 8:10

@AmitSharma I expect to parse the string with no errors in HTML, even if it contains a custom-tag. I need to make a few xpath queries, edit the content, and serialize back to html without errors. – Arak 29/3, 2016 at 8:13

Have you tried this doc = Nokogiri::HTML(string).inner_html ? – Savagism 29/3, 2016 at 8:13

@RORDeveloper Try to check doc.errors. Should I just ignore them? How can I be sure that the content will be intact? @AmitSharma inner_html seems to work the same as to_html... – Arak 29/3, 2016 at 8:16

@Arak inner_html does not adds a doctype but it wraps the content with html – Savagism 29/3, 2016 at 8:19

@AmitSharma ... yes, but I don't want the html tag as well. This is not my request. I want to parse the content and save it back without any change. – Arak 29/3, 2016 at 8:21

doc = Nokogiri::HTML.fragment(string) is the way to go, you can ignore doc.errors complaining about the invalid tag.

You are giving it invalid HTML, so you can't expect it to not report errors, but HTML parsers tend to be forgiving.

You can also use Nokogiri::XML.fragment, if you're sure the rest of it is well-formed. That won't give you errors about undefined tags.

Amphi answered 29/3, 2016 at 10:27 Comment(1)

HTML parsers in browsers aren't forgiving, they just try to be helpful and will rewrite the HTML until it's syntactically correct, even if it's not what the original intent was. Nokogiri will also do some fixup. Knowing that doc.errors says there are problems is important because tags can move during fixup. Allowing Nokogiri to parse the content as an XML fragment is probably the easiest since XML is more accepting of unknown tags, but it still has a rigid syntax that HTML doesn't so it might not solve all the problems. – Syrupy 30/3, 2016 at 16:23

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags