Sanitize/Rewrite HTML on the Client Side
Asked Answered
D

10

85

I need to display external resources loaded via cross domain requests and make sure to only display "safe" content.

Could use Prototype's String#stripScripts to remove script blocks. But handlers such as onclick or onerror are still there.

Is there any library which can at least

  • strip script blocks,
  • kill DOM handlers,
  • remove black listed tags (eg: embed or object).

So are any JavaScript related links and examples out there?

Dispensable answered 17/11, 2008 at 13:34 Comment(5)
Don't trust answers which might do this by regular expressions #1732848Commander
blog.codinghorror.com/parsing-html-the-cthulhu-wayKnotted
How is this safe? Can't users edit the javascript of a page?Judoka
yeah, it's not 'safe' unless you are simply trying to prevent mistakes by trusted users.Dorelle
I'm looking to strip anything that IS html, but leave things that are not. The problem is that "invalid HTML" is stripped, rather than left in. For example this text "e.g.R3C2<R6C2<R6C4" breaks in the sanitization methods I've found.Atlantis
M
113

Update 2016: There is now a Google Closure package based on the Caja sanitizer.

It has a cleaner API, was rewritten to take into account APIs available on modern browsers, and interacts better with Closure Compiler.


Shameless plug: see caja/plugin/html-sanitizer.js for a client side html sanitizer that has been thoroughly reviewed.

It is white-listed, not black-listed, but the whitelists are configurable as per CajaWhitelists


If you want to remove all tags, then do the following:

var tagBody = '(?:[^"\'>]|"[^"]*"|\'[^\']*\')*';

var tagOrComment = new RegExp(
    '<(?:'
    // Comment body.
    + '!--(?:(?:-*[^->])*--+|-?)'
    // Special "raw text" elements whose content should be elided.
    + '|script\\b' + tagBody + '>[\\s\\S]*?</script\\s*'
    + '|style\\b' + tagBody + '>[\\s\\S]*?</style\\s*'
    // Regular name
    + '|/?[a-z]'
    + tagBody
    + ')>',
    'gi');
function removeTags(html) {
  var oldHtml;
  do {
    oldHtml = html;
    html = html.replace(tagOrComment, '');
  } while (html !== oldHtml);
  return html.replace(/</g, '&lt;');
}

People will tell you that you can create an element, and assign innerHTML and then get the innerText or textContent, and then escape entities in that. Do not do that. It is vulnerable to XSS injection since <img src=bogus onerror=alert(1337)> will run the onerror handler even if the node is never attached to the DOM.

Maes answered 10/1, 2009 at 0:19 Comment(18)
Great, looks like there's a little documentation here: code.google.com/p/google-caja/wiki/JsHtmlSanitizerFidele
Is it possible to the sanitizer on the client side without any HTML tag whitelisted whatsoever? Even when I modified JSONs, it still acts as defaults are whitelisted...Marjorymarjy
@Almad, maybe I misunderstood your question. If you modify the white-lists, you need to regenerate html4-defs.js which is a JavaScript file generated from the JSON. That involves running ant.Maes
The Caja HTML sanitizer code looks great, but requires some glue code (neighbouring cssparser.js, but more importantly, the html4 object). Additionally, it pollutes the global window property. Is there a for-the-web version of this code? If not, do you see a better way to produce and maintain one than to create a separate project for it?Witwatersrand
@phihag, Ask at google-caja-discuss and they might point you at a packaged one. I believe the window object pollution is for backwards compatibility, and so any new package version might not need that.Maes
Turns out there already is a package for webbrowsers.Witwatersrand
@Witwatersrand That package is for nodejs, not browsers.Trabzon
@MikeSamuel I think while(html !== (html = html.replace(tagOrComment, ''))){} can be used instead of var oldHtml; do {oldHtml = html; html = html.replace(tagOrComment, ''); } while (html !== oldHtml);Cherin
What do you think of the idea of using a sandboxed iframe (with js disabled, of course) to parse HTML, then copying that DOM tree and leaving out insecure elements? See the code in my new answer to this question, for example.Numskull
@aldel, Doesn't allow-same-origin still allow scripts in attributes? What about meta redirects of parent frames? Maybe the display:none prevents leakage via image loads and CSS tricks, but not via stylesheet loads. Embedded iframes would still allow drive by downloads.Maes
@MikeSamuel, no, you would need allow-scripts to have any scripts run, including in attributes. allow-same-origin is required in order to have access to the DOM of the document in the iframe. I think the other issues you mentioned shouldn't be a problem, since the whole point of sandboxing is to allow display of untrusted HTML within your page. For example, meta redirects of parent pages (unless I misunderstand what you mean) wouldn't be possible without the allow-top-navigation permission.Numskull
I'm more worried about (1) my sandbox detection code giving a false positive; that is, it thinks the browser fully supports sandboxing when it doesn't; or (2) something is copied into the sanitized DOM tree that isn't actually safe, such as an image tag that exploits a vulnerability in a GIF decoder or something.Numskull
@MikeSamuel this script is the best I've see. Is there a way to do this that will also string "My Link" in the following example? <a href='#'>My Link</a>Outlast
@MikeSamuel I tried using the aforementioned img tag and escaping html using the create elm, assign innerHTML, and return it's innerHTML to an element here: jsbin.com/butofezipi/edit?html,css,js,output I don't know if I did it wrong...but looks like the encoding method does work for that instance? Just trying to play devil's advocate. Looking for a good method myself :b Just wanted to point out that the method that you suggest does fit my needs so thanks!Mclemore
horrible code...this is what you get when you put C++/JAVA developers to write Javascript.Acton
Would it work to assign to innerText and then retrieve innerHTML?Analyzer
@ArlenBeiler, that would convert <b>foo</b> to &lt;b&gt;foo&lt;/b&gt;, right? That's probably would not solve the OP's problem.Maes
Yes, true. Just thought I'd ask since that's what I thought, but you're right, that wouldn't really solve the original question.Analyzer
T
40

The Google Caja HTML sanitizer can be made "web-ready" by embedding it in a web worker. Any global variables introduced by the sanitizer will be contained within the worker, plus processing takes place in its own thread.

For browsers that do not support Web Workers, we can use an iframe as a separate environment for the sanitizer to work in. Timothy Chien has a polyfill that does just this, using iframes to simulate Web Workers, so that part is done for us.

The Caja project has a wiki page on how to use Caja as a standalone client-side sanitizer:

  • Checkout the source, then build by running ant
  • Include html-sanitizer-minified.js or html-css-sanitizer-minified.js in your page
  • Call html_sanitize(...)

The worker script only needs to follow those instructions:

importScripts('html-css-sanitizer-minified.js'); // or 'html-sanitizer-minified.js'

var urlTransformer, nameIdClassTransformer;

// customize if you need to filter URLs and/or ids/names/classes
urlTransformer = nameIdClassTransformer = function(s) { return s; };

// when we receive some HTML
self.onmessage = function(event) {
    // sanitize, then send the result back
    postMessage(html_sanitize(event.data, urlTransformer, nameIdClassTransformer));
};

(A bit more code is needed to get the simworker library working, but it's not important to this discussion.)

Demo: https://dl.dropbox.com/u/291406/html-sanitize/demo.html

Trabzon answered 5/7, 2012 at 11:31 Comment(11)
Great answer. Jeffrey, can you explain why the sanitization needs to be done by a web worker at all?Molokai
@AustinWang Web workers aren't strictly necessary, but since sanitization can potentially be computationally expensive and requires no user interaction, it is well suited for the task. (I also mentioned containing global variables in the main answer.)Trabzon
I cannot find decent documentation for this library. Where/how do I specify my whitelist of elements and attributes?Empty
@Empty As described by a comment in the current version, nameIdClassTransformer is called for every HTML name, element ID and list of classes; returning null will delete the attribute. By editing the JSON files in src/com/google/caja/lang/html you can also customize which elements and attributes are whitelisted.Trabzon
@JefferyTo I am sorry, maybe I am too dumb, but I don't get it. The JSON files you refer to are not used in your example and demo above. I want to use the library in a browser, so I looked at your demo. Can you modify the nameIdClassTranformer function above e.g. to reject all <script> tags and accept <b> and <i> tags?Empty
@Empty You need to check out the source code, edit the JSON files, then run ant to build the JS files appropriate for your use case.Trabzon
@Acton importScripts is part of the Web Workers API.Trabzon
Uncaught ReferenceError: importScripts is not defined on FF / ChromeActon
@Acton Did you try it inside of a Web Worker?Trabzon
ha, I thought that script itself was the script which was off-loaded to the web-worker within the browser's main thread. are you calling it from a blob?Acton
@Acton No, there is no need for blobs; I suggest taking a look at the Web Workers article aboveTrabzon
F
22

Never trust the client. If you're writing a server application, assume that the client will always submit unsanitary, malicious data. It's a rule of thumb that will keep you out of trouble. If you can, I would advise doing all validation and sanitation in server code, which you know (to a reasonable degree) won't be fiddled with. Perhaps you could use a serverside web application as a proxy for your clientside code, which fetches from the 3rd party and does sanitation before sending it to the client itself?

[edit] I'm sorry, I misunderstood the question. However, I stand by my advice. Your users will probably be safer if you sanitize on the server before sending it to them.

Fisherman answered 10/1, 2009 at 0:53 Comment(1)
Actually, with the popularity of node.js rising, a javascript solution might also be a serverside solution. That's how I ended up here at least. Still, this is excellent advice to live by.Umbra
N
19

Now that all major browsers support sandboxed iframes, there is a much simpler way that I think can be secure. I'd love it if this answer could be reviewed by people who are more familiar with this kind of security issue.

NOTE: This method definitely will not work in IE 9 and earlier. See this table for browser versions that support sandboxing. (Note: the table seems to say it won't work in Opera Mini, but I just tried it, and it worked.)

The idea is to create a hidden iframe with JavaScript disabled, paste your untrusted HTML into it, and let it parse it. Then you can walk the DOM tree and copy out the tags and attributes that are considered safe.

The whitelists shown here are just examples. What's best to whitelist would depend on the application. If you need a more sophisticated policy than just whitelists of tags and attributes, that can be accommodated by this method, though not by this example code.

var tagWhitelist_ = {
  'A': true,
  'B': true,
  'BODY': true,
  'BR': true,
  'DIV': true,
  'EM': true,
  'HR': true,
  'I': true,
  'IMG': true,
  'P': true,
  'SPAN': true,
  'STRONG': true
};

var attributeWhitelist_ = {
  'href': true,
  'src': true
};

function sanitizeHtml(input) {
  var iframe = document.createElement('iframe');
  if (iframe['sandbox'] === undefined) {
    alert('Your browser does not support sandboxed iframes. Please upgrade to a modern browser.');
    return '';
  }
  iframe['sandbox'] = 'allow-same-origin';
  iframe.style.display = 'none';
  document.body.appendChild(iframe); // necessary so the iframe contains a document
  iframe.contentDocument.body.innerHTML = input;
  
  function makeSanitizedCopy(node) {
    if (node.nodeType == Node.TEXT_NODE) {
      var newNode = node.cloneNode(true);
    } else if (node.nodeType == Node.ELEMENT_NODE && tagWhitelist_[node.tagName]) {
      newNode = iframe.contentDocument.createElement(node.tagName);
      for (var i = 0; i < node.attributes.length; i++) {
        var attr = node.attributes[i];
        if (attributeWhitelist_[attr.name]) {
          newNode.setAttribute(attr.name, attr.value);
        }
      }
      for (i = 0; i < node.childNodes.length; i++) {
        var subCopy = makeSanitizedCopy(node.childNodes[i]);
        newNode.appendChild(subCopy, false);
      }
    } else {
      newNode = document.createDocumentFragment();
    }
    return newNode;
  };

  var resultElement = makeSanitizedCopy(iframe.contentDocument.body);
  document.body.removeChild(iframe);
  return resultElement.innerHTML;
};

SECURITY HOLE: Commenter @Explosion points out that an href attribute can contain JavaScript, like <a href="javascript:alert('Oops')">. It should be possible to catch that and remove it in the sanitization code, but the above code has not (yet) been updated to do that.

You can try it out here.

Note that I'm disallowing style attributes and tags in this example. If you allowed them, you'd probably want to parse the CSS and make sure it's safe for your purposes.

I've tested this on several modern browsers (Chrome 40, Firefox 36 Beta, IE 11, Chrome for Android), and on one old one (IE 8) to make sure it bailed before executing any scripts. I'd be interested to know if there are any browsers that have trouble with it, or any edge cases that I'm overlooking.

Numskull answered 16/2, 2015 at 1:59 Comment(8)
This post deserves some attention from the experts, as it seems to be the obvious and simplest solution. Is it truly secure?Mcalpin
How can you programmatically create a hidden iframe "with JavaScript disabled"? To my best knowledge this is impossible. The minute you do iframe.contentDocument.body.innerHTML = input, whatever script tags in there will be executed.Empty
@Empty - look up the sandbox attribute on iframes.Numskull
@Numskull Indeed, I didn't know about it. For us it's still a no-go because of the lack of support in IE9. I guess your solution could work, but I think you should clarify in your response that you depend on the sandbox attribute.Empty
Sorry, I thought that was clear from my opening "Now that all major browsers support sandboxed iframes". I'll add a less subtle note.Numskull
@Acton It's working for me in Firefox 53, and I'm pretty sure it worked in whatever earlier version was out when I wrote the answer (44, I think). Do you have a plugin that's interfering, maybe?Numskull
@Numskull - oops sorry, my VPN is blocking the script file for some reason.Acton
@Mcalpin href attributes can contain javascript: <a href="javascript: alert('hello')">Test</a>Phthisic
C
14

So, it's 2016, and I think many of us are using npm modules in our code now. sanitize-html seems like the leading option on npm, though there are others.

Other answers to this question provide great input in how to roll your own, but this is a tricky enough problem that well-tested community solutions are probably the best answer.

Run this on the command line to install: npm install --save sanitize-html

ES5: var sanitizeHtml = require('sanitize-html'); // ... var sanitized = sanitizeHtml(htmlInput);

ES6: import sanitizeHtml from 'sanitize-html'; // ... let sanitized = sanitizeHtml(htmlInput);

Copyedit answered 23/8, 2016 at 17:0 Comment(3)
2018 here, this is too heavy (a half megabyte of dependencies)Gaudreau
2020 here, sanitize-html is for Node and there's still no good option for browsers as far as I can tellDougdougal
2021 here, and it seems that v2 of sanitize-html is now only ~80kB and works in browser.Canal
D
12

You can't anticipate every possible weird type of malformed markup that some browser somewhere might trip over to escape blacklisting, so don't blacklist. There are many more structures you might need to remove than just script/embed/object and handlers.

Instead attempt to parse the HTML into elements and attributes in a hierarchy, then run all element and attribute names against an as-minimal-as-possible whitelist. Also check any URL attributes you let through against a whitelist (remember there are more dangerous protocols than just javascript:).

If the input is well-formed XHTML the first part of the above is much easier.

As always with HTML sanitisation, if you can find any other way to avoid doing it, do that instead. There are many, many potential holes. If the major webmail services are still finding exploits after this many years, what makes you think you can do better?

Diazomethane answered 17/11, 2008 at 15:26 Comment(0)
C
4

[Disclaimer: I'm one of the authors]

We wrote a "web-only" (i.e. "requires a browser") open source library for this, https://github.com/jitbit/HtmlSanitizer that removes all tags/attributes/styles except the "whitelisted" ones.

Usage:

var input = HtmlSanitizer.SanitizeHtml("<script> Alert('xss!'); </scr"+"ipt>");

P.S. Works much faster than a "pure JavaScript" solution since it uses the browser to parse and manipulate DOM. If you're interested in a "pure JS" solution please try https://github.com/punkave/sanitize-html (not affiliated)

Celestina answered 21/1, 2019 at 15:6 Comment(0)
E
2

The Google Caja library suggested above was way too complex to configure and include in my project for a Web application (so, running on the browser). What I resorted to instead, since we already use the CKEditor component, is to use it's built-in HTML sanitizing and whitelisting function, which is far more easier to configure. So, you can load a CKEditor instance in a hidden iframe and do something like:

CKEDITOR.instances['myCKEInstance'].dataProcessor.toHtml(myHTMLstring)

Now, granted, if you're not using CKEditor in your project this may be a bit of an overkill, since the component itself is around half a megabyte (minimized), but if you have the sources, maybe you can isolate the code doing the whitelisting (CKEDITOR.htmlParser?) and make it much shorter.

http://docs.ckeditor.com/#!/api

http://docs.ckeditor.com/#!/api/CKEDITOR.htmlDataProcessor

Empty answered 16/3, 2016 at 13:38 Comment(0)
P
1

Instead of using regex,I thought of a way using native DOM stuff. This way you can parse the HTML to a doc, get that HTML and easily get all of a certain element and whitelist elements and attributes to remove. This uses a list of attributes as either an array of simple strings of attributes to allow, or it can use a regex to validate their values and only allow on certain tags.

const sanitize = (html, tags = undefined, attributes = undefined) => {
    var attributes = attributes || [
      { attribute: "src", tags: "*", regex: /^(?:https|http|\/\/):/ },
      { attribute: "href", tags: "*", regex: /^(?!javascript:).+/ },
      { attribute: "width", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "height", tags: "*", regex: /^[0-9]+$/ },
      { attribute: "id", tags: "*", regex: /^[a-zA-Z]+$/ },
      { attribute: "class", tags: "*", regex: /^[a-zA-Z ]+$/ },
      { attribute: "value", tags: ["INPUT", "TEXTAREA"], regex: /^.+$/ },
      { attribute: "checked", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      {
        attribute: "placeholder",
        tags: ["INPUT", "TEXTAREA"],
        regex: /^.+$/,
      },
      {
        attribute: "alt",
        tags: ["IMG", "AREA", "INPUT"],
        //"^" and "$" match beggining and end
        regex: /^[0-9a-zA-Z]+$/,
      },
      { attribute: "autofocus", tags: ["INPUT"], regex: /^(?:true|false)+$/ },
      { attribute: "for", tags: ["LABEL", "OUTPUT"], regex: /^[a-zA-Z0-9]+$/ },
    ]
    var tags = tags || [
      "I",
      "P",
      "B",
      "BODY",
      "HTML",
      "DEL",
      "INS",
      "STRONG",
      "SMALL",
      "A",
      "IMG",
      "CITE",
      "FIGCAPTION",
      "ASIDE",
      "ARTICLE",
      "SUMMARY",
      "DETAILS",
      "NAV",
      "TD",
      "TH",
      "TABLE",
      "THEAD",
      "TBODY",
      "NAV",
      "SPAN",
      "BR",
      "CODE",
      "PRE",
      "BLOCKQUOTE",
      "EM",
      "HR",
      "H1",
      "H2",
      "H3",
      "H4",
      "H5",
      "H6",
      "DIV",
      "MAIN",
      "HEADER",
      "FOOTER",
      "SELECT",
      "COL",
      "AREA",
      "ADDRESS",
      "ABBR",
      "BDI",
      "BDO",
    ]

    attributes = attributes.map((el) => {
      if (typeof el === "string") {
        return { attribute: el, tags: "*", regex: /^.+$/ }
      }
      let output = el
      if (!el.hasOwnProperty("tags")) {
        output.tags = "*"
      }
      if (!el.hasOwnProperty("regex")) {
        output.regex = /^.+$/
      }
      return output
    })
    var el = new DOMParser().parseFromString(html, "text/html")
    var elements = el.querySelectorAll("*")
    for (let i = 0; i < elements.length; i++) {
      const current = elements[i]
      let attr_list = get_attributes(current)
      for (let j = 0; j < attr_list.length; j++) {
        const attribute = attr_list[j]
        if (!attribute_matches(current, attribute)) {
          current.removeAttribute(attr_list[j])
        }
      }
      if (!tags.includes(current.tagName)) {
        current.remove()
      }
    }
    return el.documentElement.innerHTML
    function attribute_matches(element, attribute) {
      let output = attributes.filter((attr) => {
        let returnval =
          attr.attribute === attribute &&
          (attr.tags === "*" || attr.tags.includes(element.tagName)) &&
          attr.regex.test(element.getAttribute(attribute))
        return returnval
      })

      return output.length > 0
    }
    function get_attributes(element) {
      for (
        var i = 0, atts = element.attributes, n = atts.length, arr = [];
        i < n;
        i++
      ) {
        arr.push(atts[i].nodeName)
      }
      return arr
    }
  }
* {
  font-family: sans-serif;
}
textarea {
  width: 49%;
  height: 300px;
  padding: 10px;
  box-sizing: border-box;
  resize: none;
}
<h1>Sanitize HTML client side</h1>
<textarea id='input' placeholder="Unsanitized HTML">
&lt;!-- This removes both the src and onerror attributes because src is not a valid url. --&gt;
&lt;img src=&quot;error&quot; onerror=&quot;alert('XSS')&quot;&gt;
&lt;div id=&quot;something_harmless&quot; onload=&quot;alert('More XSS')&quot;&gt;
   &lt;b&gt;Bold text!&lt;/b&gt; and &lt;em&gt;Italic text!&lt;/em&gt;, some more text. &lt;del&gt;Deleted text!&lt;/del&gt;
&lt;/div&gt;
 &lt;script&gt;
    alert(&quot;This would be XSS&quot;);
  &lt;/script&gt;
</textarea>
<textarea id='output' placeholder="Sanitized HTML will appear here" readonly></textarea>
<script>
  document.querySelector("#input").onkeyup = () => {
    document.querySelector("#output").value = sanitize(document.querySelector("#input").value);
  }
</script>
Phthisic answered 3/2, 2021 at 13:34 Comment(0)
H
0

I recommend cutting frameworks out of your life, it would make things excessively easier for you in the long term.

cloneNode: Cloning a node copies all of its attributes and their values but does NOT copy event listeners.

https://developer.mozilla.org/en/DOM/Node.cloneNode

The following is not tested though I have using treewalkers for some time now and they are one of the most undervalued parts of JavaScript. Here is a list of the node types you can crawl, usually I use SHOW_ELEMENT or SHOW_TEXT.

http://www.w3.org/TR/DOM-Level-2-Traversal-Range/traversal.html#Traversal-NodeFilter

function xhtml_cleaner(id)
{
 var e = document.getElementById(id);
 var f = document.createDocumentFragment();
 f.appendChild(e.cloneNode(true));

 var walker = document.createTreeWalker(f,NodeFilter.SHOW_ELEMENT,null,false);

 while (walker.nextNode())
 {
  var c = walker.currentNode;
  if (c.hasAttribute('contentEditable')) {c.removeAttribute('contentEditable');}
  if (c.hasAttribute('style')) {c.removeAttribute('style');}

  if (c.nodeName.toLowerCase()=='script') {element_del(c);}
 }

 alert(new XMLSerializer().serializeToString(f));
 return f;
}


function element_del(element_id)
{
 if (document.getElementById(element_id))
 {
  document.getElementById(element_id).parentNode.removeChild(document.getElementById(element_id));
 }
 else if (element_id)
 {
  element_id.parentNode.removeChild(element_id);
 }
 else
 {
  alert('Error: the object or element \'' + element_id + '\' was not found and therefore could not be deleted.');
 }
}
Hoyden answered 4/7, 2012 at 5:58 Comment(2)
This code assumes that the input to clean has already been parsed and even inserted into the document tree. If that's the case, the malicious scripts have already been executed. The input should be a string.Witwatersrand
Then send a DOM fragment to it, just because it's in the DOM in a given shape or form does not actually imply that it has been executed. Presuming he's loading it via AJAX he can use this in conjunction with importNode.Hoyden

© 2022 - 2024 — McMap. All rights reserved.