Cleaning HTML with Nokogiri (instead of Tidy)
Asked Answered
F

2

8

The tidy gem is no longer maintained and has multiple memory leak issues.

Some people suggested using Nokogiri.

I'm currently cleaning the HTML using:

Nokogiri::HTML::DocumentFragment.parse(html).to_html

I've got two issues though:

  • Nokogiri removes the DOCTYPE

  • Is there an easy way to force the cleaned HTML to have a html and body tag?

Fostoria answered 7/4, 2011 at 17:8 Comment(0)
H
8

If you are processing a full document, you want:

Nokogiri::HTML(html).to_html

That will force html and body tags, and introduce or preserve the DOCTYPE:

puts Nokogiri::HTML('<p>Hi!</p>').to_html
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
#=>  "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body><p>Hi!</p></body></html>

puts Nokogiri::HTML('<!DOCTYPE html><p>Hi!</p>').to_html
#=> <!DOCTYPE html>
#=> <html><body><p>Hi!</p></body></html>

Note that the output is not guaranteed to be syntactically valid. For example, if I provide a broken document that lies and claims that it is HTML4.01 strict, Nokogiri will output a document with that DOCTYPE but without the required <head><title>...</title></head> section:

dtd = '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">'
puts Nokogiri::HTML("#{dtd}<p>Hi!</p>").to_html
#=> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
#=>  "http://www.w3.org/TR/html4/strict.dtd">
#=> <html><body><p>Hi!</p></body></html>
Hover answered 7/4, 2011 at 17:19 Comment(2)
Wow. I should have looked a bit harder. Thanks a bunch. :) On a side note, do you have any comments on how Nokogiri performs versus Tidy? Or maybe you have a suggestion of a Ruby lib that cleans HTML better?Fostoria
@Christian I have no experience with Tidy, Rubyful Soup, Hpricot, or any other library for tidying HTML. Nokogiri is my toolkit of choice for all things HTML and XML, though I am almost always dealing with syntactically-valid markup.Hover
I
2

The Tidy gem might not be supported, but the underlying tidy app is maintained, and that is what you really need. It's flexible and has quite a list of options.

You can pass HTML to it in many different ways, and define its configuration in a .tidyrc file or pass them on the command-line. You could use Ruby's %x{} to pass it a file or use IO.popen, or IO.pipe to treat it as a pipe.

Impeller answered 7/4, 2011 at 18:6 Comment(3)
Yeah that was another option I had in mind, but I wanted to avoid creating a sub-process each time.Fostoria
Then use IO.pipe and hold it open. You'd only have to do it one time for the entire app session.Impeller
@Christian Joudrey I just found someone is working on a new tidy for Ruby: github.com/carld/tidyImpeller

© 2022 - 2024 — McMap. All rights reserved.