Is there an alternative to HTML Tidy?
Asked Answered
F

4

23

I have embedded HTML Tidy in my application to clean incoming HTML. But Tidy has a huge amount of bugs and fixing them directly in the source is my worst nightmare. Tidy source code is an unreadable abomination. Thousand+ line functions, poor variable naming, spaghetti code etc. It's truly horrible.

Worse yet, official development seems to have ceased. In the last 12 months, there have been three write transactions to the official CVS repo. But it's been dead and buried for much longer than that...

So I'm looking for an OSS C or C++ application/library that can do what Tidy can (when it feels like it): fix bad HTML markup and transform it into valid XHTML (this is the part I'm interested in). And I mean all sorts of bad markup.

Is there something like that out there?

EDIT: I need it both for manipulations on the DOM tree by an XML handling tool and for general compliance with the XHTML spec. My app needs to accept HTML from users (which is often invalid in all sorts of ways) and output valid XHTML. It needs to be able to handle even HTML that would normally not display in a browser because the user edited it by hand and didn't check afterwards.

A drop-in replacement for Tidy's error-correcting parser... that doesn't suck. I don't mind bugs if the source is readable and I can fix problems myself, or if there are active developers who provide bugfixes on a timely basis.

Festival answered 21/2, 2010 at 18:49 Comment(1)
Don't know if this is any use to you, but there's a Java library called TagSoup (home.ccil.org/~cowan/XML/tagsoup) which apparently has a couple of C++ ports, maybe, except one's not free and I'm not sure the other's maintained. It produces a stream of SAX events, but turning that into XML output should just be a matter of attaching the right pipe to the nozzle. Never used it myself, though.Synthesize
D
2

Could you tell us what you plan to use this tool for? As in, do you want to fix static web pages, or do you want some sort of filtering step before other manipulations, so that some tool can handle buggy web pages?

Personally, I write my own tool atop Python's BeautifulSoup or lxml whenever I need to --- it's at most a dozen line script and does much of what I want.

Dissipated answered 21/2, 2010 at 18:55 Comment(3)
I can't use Python or its libraries. This is a GUI, native code application. Integrating the Python interpreter is not an option.Festival
Well, for a GUI native code app, technically integrating the Python interpreter is an option, but maybe not an appealing one when you evaluate the pros and cons. docs.python.org/extending/embedding.htmlMidgard
Then I'd look at native bindings for lxml --- it can do parsing quite well, even for horribly broken html.Dissipated
A
2

There is a new, nice, proper HTML 5 supporting Tidy, so the alternative to old, ugly Tidy would be Tidy (GitHub repository).

Ashia answered 29/9, 2015 at 18:52 Comment(0)
G
1

Try Pretty Diff. It is a vastly superior beautification algorithm and it does not make any assumptions about your input.

http://prettydiff.com/?m=beautify&html

Groan answered 10/12, 2011 at 12:49 Comment(1)
Disclose your affiliation.Ulda
T
-1

For something that actually fixes code, your best bet is still HTML Tidy. There are a lot of linters, but not really anything that repairs errors to HTML, other than Tidy.

At first glance, modern OOP programmers might think that the source code is an unreadable abomination, but in the C world, Tidy is pretty sophisticated library that uses a lot of advanced OO concepts and offers a very thoughtful interface that exposes nearly all of its functionality in a pure C API.

A casual developer will be lost, but once immersed, the code is quite beautiful. Granted, naming conventions are a mixed bad, but PR's are welcome!

Tzar answered 11/10, 2017 at 1:30 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.