Whitelisting, preventing XSS with WMD control in C#
Asked Answered
O

1

1

Are there any problems with what I am doing here? This is my first time to deal with something like this, and I just want to make sure I understand all the risks, etc. to different methods.

I am using WMD to get user input, and I am displaying it with a literal control. Since it is uneditable once entered I will be storing the HTML and not the Markdown,

input = Server.HTMLEncode(stringThatComesFromWMDTextArea)

And then run something like the following for tags I want users to be able to use.

// Unescape whitelisted tags.
string output = input.Replace("&lt;b&gt;", "<b>").Replace("&lt;/b&gt;", "</b>")
                     .Replace("&lt;i&gt;", "<i>").Replace("&lt;/i&gt;", "</i>");

Edit Here is what I am doing currently:

 public static string EncodeAndWhitelist(string html)
 {
     string[] whiteList = { "b", "i", "strong", "img", "ul", "li" };
     string encodedHTML = HttpUtility.HtmlEncode(html);
     foreach (string wl in whiteList)
         encodedHTML = encodedHTML.Replace("&lt;" + wl + "&gt;", "<" + wl + ">").Replace("&lt;/" + wl + "&gt;", "</" + wl + ">");
     return encodedHTML;
 }
  1. Will what I am doing here keep me protected from XSS?
  2. Are there any other considerations that should be made?
  3. Is there a good list of normal tags to whitelist?
Osteoporosis answered 20/1, 2010 at 20:2 Comment(1)
That code won't work for the "IMG" tag since replacing "&lt;img&gt;" doesn't allow for the "src" attribute".Cofferdam
S
2

If your requirements really are that basic that you can do such simple string replacements then yes, this is ‘safe’ against XSS. (However, it's still possible to submit non-well-formed content where <i> and <b> are mis-nested or unclosed, which could potentially mess up the page the content ends up inserted into.)

But this is rarely enough. For example currently <a href="..."> or <img src="..." /> are not allowed. If you wanted to allow these or other markup with attribute values in, you'd have a whole lot more work to do. You might then approach it with regex, but that gives you endless problems with accidental nesting and replacement of already-replaced content, seeing as how regex can't parse HTML, and that.

To solve both problems, the usual approach is to use an [X][HT]ML parser on the input, then walk the DOM removing all but known-good elements and attributes, then finally re-serialise to [X]HTML. The result is then guaranteed well-formed and contains only safe content.

Soapbox answered 20/1, 2010 at 20:28 Comment(2)
So, assuming I wanted something more robust, what would you suggest for the parsers you mentioned? Could HTML Agility Pack handle it? Is there not something that does all this already?Osteoporosis
Yes, HTML Agility Pack is a good choice. Once you've got the DOM parsed it's a relatively trivial exercise to write a recursive function that removes all but known-good elements/attributes from the DOM tree. Also if you allow href/src/etc., remember to check the URLs for known-good schemes like http/https, to avoid injection through javascript: URLs and the like.Soapbox

© 2022 - 2024 — McMap. All rights reserved.