Whitelisting, preventing XSS with WMD control in C#

Are there any problems with what I am doing here? This is my first time to deal with something like this, and I just want to make sure I understand all the risks, etc. to different methods.

I am using WMD to get user input, and I am displaying it with a literal control. Since it is uneditable once entered I will be storing the HTML and not the Markdown,

input = Server.HTMLEncode(stringThatComesFromWMDTextArea)

And then run something like the following for tags I want users to be able to use.

// Unescape whitelisted tags.
string output = input.Replace("&lt;b&gt;", "<b>").Replace("&lt;/b&gt;", "</b>")
                     .Replace("&lt;i&gt;", "<i>").Replace("&lt;/i&gt;", "</i>");

Edit Here is what I am doing currently:

 public static string EncodeAndWhitelist(string html)
 {
     string[] whiteList = { "b", "i", "strong", "img", "ul", "li" };
     string encodedHTML = HttpUtility.HtmlEncode(html);
     foreach (string wl in whiteList)
         encodedHTML = encodedHTML.Replace("&lt;" + wl + "&gt;", "<" + wl + ">").Replace("&lt;/" + wl + "&gt;", "</" + wl + ">");
     return encodedHTML;
 }

Will what I am doing here keep me protected from XSS?
Are there any other considerations that should be made?
Is there a good list of normal tags to whitelist?

If your requirements really are that basic that you can do such simple string replacements then yes, this is ‘safe’ against XSS. (However, it's still possible to submit non-well-formed content where <i> and <b> are mis-nested or unclosed, which could potentially mess up the page the content ends up inserted into.)

But this is rarely enough. For example currently <a href="..."> or <img src="..." /> are not allowed. If you wanted to allow these or other markup with attribute values in, you'd have a whole lot more work to do. You might then approach it with regex, but that gives you endless problems with accidental nesting and replacement of already-replaced content, seeing as how regex can't parse HTML, and that.

To solve both problems, the usual approach is to use an [X][HT]ML parser on the input, then walk the DOM removing all but known-good elements and attributes, then finally re-serialise to [X]HTML. The result is then guaranteed well-formed and contains only safe content.

Recommended topics

Hot tags