Parse an HTML fragment whitelisting some custom tags
Asked Answered
A

1

6

I'm trying to parse an HTML fragment that contains a custom HTML tag using Nokogiri.

Example:

string = "<div>hello</div>\n<custom-tag></custom-tag>"

I tried to load it in many ways, but none is optimal.

If I use Nokogiri::HTML:

doc = Nokogiri::HTML(string)

When I use to_html, it adds a doctype and an html tag that wraps the content. It's undesired.

If I use Nokogiri::XML:

doc = Nokogiri::XML(string)

I got Error at line 2: Extra content at the end of the document, since in XML there must be a root tag that wraps all the document content. If I try to save this content again, The output is <div>hello</div> (every tag after the first is removed)

I tried also doc = Nokogiri::HTML.fragment:

doc = Nokogiri::HTML.fragment(string)

But it complains about the custom-tag.

How can I make Nokogiri parse correctly with this HTML fragment?

Arak answered 29/3, 2016 at 8:2 Comment(6)
what is your expected result?Savagism
@AmitSharma I expect to parse the string with no errors in HTML, even if it contains a custom-tag. I need to make a few xpath queries, edit the content, and serialize back to html without errors.Arak
Have you tried this doc = Nokogiri::HTML(string).inner_html ?Savagism
@RORDeveloper Try to check doc.errors. Should I just ignore them? How can I be sure that the content will be intact? @AmitSharma inner_html seems to work the same as to_html...Arak
@Arak inner_html does not adds a doctype but it wraps the content with htmlSavagism
@AmitSharma ... yes, but I don't want the html tag as well. This is not my request. I want to parse the content and save it back without any change.Arak
A
6

doc = Nokogiri::HTML.fragment(string) is the way to go, you can ignore doc.errors complaining about the invalid tag.

You are giving it invalid HTML, so you can't expect it to not report errors, but HTML parsers tend to be forgiving.

You can also use Nokogiri::XML.fragment, if you're sure the rest of it is well-formed. That won't give you errors about undefined tags.

Amphi answered 29/3, 2016 at 10:27 Comment(1)
HTML parsers in browsers aren't forgiving, they just try to be helpful and will rewrite the HTML until it's syntactically correct, even if it's not what the original intent was. Nokogiri will also do some fixup. Knowing that doc.errors says there are problems is important because tags can move during fixup. Allowing Nokogiri to parse the content as an XML fragment is probably the easiest since XML is more accepting of unknown tags, but it still has a rigid syntax that HTML doesn't so it might not solve all the problems.Syrupy

© 2022 - 2024 — McMap. All rights reserved.