C# version of HTML Tidy?
Asked Answered
S

6

10

I am just looking for a really easy way to clean up some HTML (possibly with embedded JavaScript code). I tried two different HTML Tidy .NET ports and both are throwing exceptions...

Sorry, by "clean" I mean "indent". The HTML is not malformed, at all. It's XHTML strict.


I finally got something working with SGML, but this is seriously the most ridiculous chunk of code ever to indent some HTML.

private static string FormatHtml(string input)
{
    var sgml = new SgmlReader {DocType = "HTML", InputStream = new StringReader(input)};
    using (var sw = new StringWriter())
    using (var xw = new XmlTextWriter(sw) { Indentation = 2, Formatting = Formatting.Indented })
    {
        sgml.Read();
        while (!sgml.EOF)
            xw.WriteNode(sgml, true);
    }
    return sw.ToString();
}
Semiology answered 23/10, 2010 at 3:37 Comment(4)
So you just want to reformat your source code? You can use any web-ide or Notepad++ for that.Pragmaticism
@Nick: I realize that, but I'm not trying reformat HTML files I already have.. I'm trying to reformat HTML that I'm generating in a C# app...Semiology
check HtmlTextWriter, I updated my answerPragmaticism
Just to self-promote, my version back from 2007 ist over at The Code Project. Still using it in commercial projects.Britannia
T
10

The latest C# wrapper for HTML Tidy was done by Mark Beaton, which seems rather more up-to-date than the links you've referenced (2003). Also worth of note is that Mark provides executables for referencing as well, rather than pulling them from the official site. That should do the trick of nicely organising and validating your HTML.

Tapia answered 11/1, 2011 at 14:39 Comment(5)
The builds are just for tidylib, not the C# wrapper. You'll need to build TidyManaged from source as well. I'm running a 64-bit machine, but only the 32-bit tidylib dll works, for whatever reason. I had to put it in c:/windows/system. Also, the example Beaton provides won't indent your HTML -- the only thing I wanted -- you need to add doc.IndentBlockElements = AutoBool.Auto... little tricky to figure out.Semiology
Agreed, I've came rather un-stuck after moving to x64 and tidylib is throwing an exception "BadImageFormatException occured - An attempt was made to load a program with an incorrect format. (Exception from HRESULT: 0x8007000B)". Posted a bug on TidyManaged github.com/markbeaton/TidyManaged/issues/3Tapia
I've managed to get this working on Windows 7 64 bit by changing the project to x86 in Configuration Manager on both the TidyManaged project and my project that references it and using the 32 bit version of libtidy.dll.Lattonia
I've just tried my 64-bit build of libtidy.dll under 64-bit Windows 7, and as long as the TidyManaged wrapper code is built using "Any CPU" and your referencing project is also "Any CPU", things work fine. If running under ASP.NET, you'll need to make sure your app pool is running in 64-bit mode as well ("Enable 32-bit applications" should be false). Also, you shouldn't need to drop libtidy.dll into your system directory - just putting it into your app's bin folder should be enough.Quechua
Also, I've just uploaded a release "Any CPU" build of the TidyManaged .NET wrapper library to GitHub: github.com/markbeaton/TidyManaged/downloadsQuechua
L
18

AngleSharp 100% c#

    var parser = new HtmlParser();
    
    var document = parser.ParseDocument("<html><head></head><body><i></i></body></html>");

    var sw = new StringWriter();
    document.ToHtml(sw, new PrettyMarkupFormatter());

    var HTML_prettified = sw.ToString();

edit by sebastian :

 //old parse method
 var document = parser.Parse("<html><head></head><body><i></i></body></html>");

 //new parse method (for AngleSharp 0.16.1): 
 var document = await parser.ParseDocumentAsync(Code); 
 
Landin answered 18/10, 2018 at 12:46 Comment(3)
Seems like a great project but it does not clean HTML. Would be nice if they offered some option to actually clean html, or better convert to XHTML, but it does not seem to target those scenarios.Cost
This should be the accepted answer. It's 100% C# and does what the OP asked for in 5 lines of code.Shellyshelman
Updated version (Nuget Package AngleSharp 0.16.1): ``` var parser = new HtmlParser(); var document = await parser.ParseDocumentAsync(Code); var sw = new StringWriter(); document.ToHtml(sw, new PrettyMarkupFormatter()); var HTML_prettified = sw.ToString(); ```Vengeance
T
10

The latest C# wrapper for HTML Tidy was done by Mark Beaton, which seems rather more up-to-date than the links you've referenced (2003). Also worth of note is that Mark provides executables for referencing as well, rather than pulling them from the official site. That should do the trick of nicely organising and validating your HTML.

Tapia answered 11/1, 2011 at 14:39 Comment(5)
The builds are just for tidylib, not the C# wrapper. You'll need to build TidyManaged from source as well. I'm running a 64-bit machine, but only the 32-bit tidylib dll works, for whatever reason. I had to put it in c:/windows/system. Also, the example Beaton provides won't indent your HTML -- the only thing I wanted -- you need to add doc.IndentBlockElements = AutoBool.Auto... little tricky to figure out.Semiology
Agreed, I've came rather un-stuck after moving to x64 and tidylib is throwing an exception "BadImageFormatException occured - An attempt was made to load a program with an incorrect format. (Exception from HRESULT: 0x8007000B)". Posted a bug on TidyManaged github.com/markbeaton/TidyManaged/issues/3Tapia
I've managed to get this working on Windows 7 64 bit by changing the project to x86 in Configuration Manager on both the TidyManaged project and my project that references it and using the 32 bit version of libtidy.dll.Lattonia
I've just tried my 64-bit build of libtidy.dll under 64-bit Windows 7, and as long as the TidyManaged wrapper code is built using "Any CPU" and your referencing project is also "Any CPU", things work fine. If running under ASP.NET, you'll need to make sure your app pool is running in 64-bit mode as well ("Enable 32-bit applications" should be false). Also, you shouldn't need to drop libtidy.dll into your system directory - just putting it into your app's bin folder should be enough.Quechua
Also, I've just uploaded a release "Any CPU" build of the TidyManaged .NET wrapper library to GitHub: github.com/markbeaton/TidyManaged/downloadsQuechua
P
2

UPDATE:

Check HtmlTextWriter or XhtmlTextWriter, usage: Formatting Html Output with HtmlTextWriter, maybe HTML construction via HtmlTextWriter will be better?

Also check : LINQ & Lambda, Part 3: Html Agility Pack to LINQ to XML Converter

http://www.manoli.net/csharpformat/, here source code in case you miss it.


Maybe you want to do it yourself? This project can be helpful: Html Agility Pack

What is exactly the Html Agility Pack (HAP)?

This is an agile HTML parser that builds a read/write DOM and supports plain XPATH or XSLT (you actually don't HAVE to understand XPATH nor XSLT to use it, don't worry...). It is a .NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML. The object model is very similar to what proposes System.Xml, but for HTML documents (or streams).

Html Agility Pack now supports Linq to Objects (via a LINQ to Xml Like interface). Check out the new beta to play with this feature

Sample applications:

  • Page fixing or generation. You can fix a page the way you want, modify the DOM, add nodes, copy nodes, well... you name it.

  • Web scanners. You can easily get to img/src or a/hrefs with a bunch XPATH queries.

  • Web scrapers. You can easily scrap any existing web page into an RSS feed for example, with just an XSLT file serving as the binding. An example of this is provided.


Also you can try this implementation: A managed wrapper for the HTML Tidy library

Pragmaticism answered 23/10, 2010 at 3:40 Comment(8)
I've heard of and have used HtmlAgilityPack a lot in the past..but can it tidy up HTML?Semiology
HAP is not a replacement for Tidy rather it can build DOM for you and you can process it accordingly. Also Im not sure is it smart enough to parse malformed HTML (if you have to process something weird). BTW, can you define a bit better what you mean by "clean", which rules have to be applied? Also you can use original HTML Tidy (bit.ly/aahXs8) without rely on wrapper if you just need to clean some files not on regular basis.Pragmaticism
I don't need to to process the DOM, I just want to indent it. I specifically want a C# version because I need to use it in my C# project. I'm generating some HTML as a string, I want to take that string, have it indented, and output another string. No more, no less. Thought it would be easy to find a library to do that.Semiology
That codeproject looks nice, but it doesn't compile either. DLL linker errors.Semiology
Also, what DLL do I need to reference to access HtmlTextWriter? I can't find it anywhere in VS2010. System.Web.UI doesn't exist.Semiology
Probably your app is target to client profile? You have to switch it to full and reference System.Web.dllPragmaticism
There are also XhtmlTextWriter Class bit.ly/9VlCND, since you have to output XHTMLPragmaticism
Ahh... good call with the client profile. I'm going to look at the HtmlWriters.Semiology
C
1

I've used SGML Reader to convert HTML to XHTML in the past. Might be worth looking into...

I never had any problems with it when I was using it.

Ceilometer answered 23/10, 2010 at 3:59 Comment(1)
A bit ridiculous to format some HTML, but it does work. Thanks :)Semiology
F
1

You can use HtmlAgilityPack (add this package from nuget).

Code sample:

string html = "<div><p>line 1<br>line 2</p><span></div>";
var htmlDoc = new HtmlAgilityPack.HtmlDocument();
htmlDoc.LoadHtml(description);
var fixedHtml = htmlDoc.DocumentNode.OuterHtml;

Output:

<div><p>line 1<br />line 2</p><span></span></div>
Farfetched answered 17/4, 2019 at 19:26 Comment(0)
T
0

Beautifier provides html I used html-beautify. for example

const beautified = html_beautify("<div><p></p></div>");
console.log(beautified)
<script src="https://cdnjs.cloudflare.com/ajax/libs/js-beautify/1.14.0/beautify-html.min.js"></script>
Teniers answered 10/7, 2021 at 6:10 Comment(1)
Welcome to Stackoverflow. This question is about C#. Your answer seems to suggest a JS library. Please keep your answers on-topic.Semiology

© 2022 - 2024 — McMap. All rights reserved.