Simple HTML sanitizer in Javascript [closed]
Asked Answered
M

3

49

I'm looking for a simple HTML sanitizer written in JavaScript. It doesn't need to be 100% XSS secure.

I'm implementing Markdown and the WMD Markdown editor (The SO master branch from github) on my website. The problem is that the HTML shown in the live preview isn't filtered, like it here on SO. I am looking for a simple/quick HTML sanitizer written in JavaScript so that i can filter the contents of the preview window.

No need for a full parser with complete XSS protection. I'm not sending the output back to the server. I'm sending the Markdown to the server where I use a proper, full HTML sanitizer before I store the result in the database.

Google is being absolutely useless to me. I just get hundreds of (often incorrect) articles on how to filter out javascript from user generated HTML in all kinds of server-side languages.

UPDATE

I'll explain a bit better why I need this. My website has an editor very similar to the one here on StackOverflow. There's a text area to enter MarkDown syntax and a preview window below it that shows you how it will look like after you submitted it.

When the user submits something, it is sent to the server in MarkDown format. The server converts it to HTML and then runs a HTML sanitizer on it to clean up the HTML. MarkDown allows arbitrary HTML so I need to clean it up. For example, the user types something like this:

<script>alert('Boo!');</script>

The MarkDown converter does not touch it since it's HTML. The HTML sanitizer will strip it so the script element is gone.

But this is not what happens in the preview window. The preview window only converts MarkDown to HTML but does not sanitize it. So, the preview window will have a script element.This means the preview window is different from the actual rendering on the server.

I want to fix this, so I need a quick-and-dirty JavaScript HTML sanitizer. Something simple with basic element/attribute blacklisting and whitelisting will do. It does not need to be XSS safe because XSS protection is done by the server-side HTML sanitizer.

This is just to make sure the preview window will match the actual rendering 99.99% of the time, which is good enough for me.

Can you help? Thanks in advance!

Mescaline answered 28/10, 2009 at 13:35 Comment(4)
FWIW, I hate it when the preview doesn't match what is published.Hieroglyphic
@ms2ger: That's why I need the HTML sanitizer, so that the preview will match what the server does on the back-end.Mescaline
isn't it a problem to allow would be attackers to test their attacks in their browser while you don't see any of their attempts ?Helaine
See Sanitizer API specification on GitHub.Reefer
S
22

We've developed a simple HtmlSantizer and opensourced it here: https://github.com/jitbit/HtmlSanitizer

Usage

var result = HtmlSanitizer.SanitizeHtml(input);

[Disclaimer! I'm one of the authors!]

Sorb answered 18/1, 2019 at 14:37 Comment(6)
thx, but this approach still runs scripts like img onerror + confusion whether this code is GNU or MIT licensed (code-header vs license-file)Giacinta
@Giacinta fixed the license confusion, thanks. If the onerror attribute is not whitelisted it's being removed from HTML, what do you mean? (in any case I invite you to open an issue at Githuib so we can properly fix it)Sorb
sorry, it actually works as expected. I tried cleaning <img src=x onerror=alert(1) > and the onerror is correctly removedGiacinta
var hthl = '<TR><TD>Hello</TD></TR>'; console.log(HtmlSanitizer.SanitizeHtml(html)); The script also remove tr, tdGaige
After browsing for a good solution for hours, this is exactly what I needed. Came in late to the paty to thank you @Alex.Rosinarosinante
Thank you for this beautiful work :) i was desperately looking for something like thisBalfore
E
11

Another hint: as of May 2021 there is am upcoming Sanitizer API in Firefox.

const inputString = 'Some text <b><i>with</i></b> <blink>tags</blink>,, including a rogue script <script>alert(1)</script> def.';
const result = new Sanitizer().sanitizeToString(inputString);
console.log(result);
// Logs "Some text <b><i>with</i></b>, including a rogue script def."

(MDN example)

See: https://developer.mozilla.org/en-US/docs/Web/API/HTML_Sanitizer_API

If this feature is accepted by other vendors as well, it might help us get rid of JS-sanitizer-implementations.

Endothermic answered 9/5, 2021 at 13:13 Comment(4)
It is now officially available in Chrome 103. See the Chrome feature specification.Reefer
For Firefox, you can see the tracking issue/bug here.Reefer
Sorry. Seems it will be available in Chrome 105 (feature specification, the issue).Reefer
We're on version 120 and it's still not here yet...Exigent
S
7

Here is a 2kb (depends on Snarkdown, which is a 1kb markdown renderer, replace with what you need) vue component that will render escaped markdown, optionally even translating B & I tags for content that may include those tags with formatting...

<template>
  <div v-html="html">
  </div>
</template>

<script>
import Snarkdown from 'snarkdown'
export default {
  props: ['code', 'bandi'],
  computed: {
    html () {
      // Convert b & i tags if flagged...
      const unsafe = this.bandi ? this.code
        .replace(/<b>/g, '**')
        .replace(/<\/b>/g, '**')
        .replace(/<i>/g, '*')
        .replace(/<\/i>/g, '*') : this.code

      // Process the markdown after we escape the html tags...
      return Snarkdown(unsafe
        .replace(/&/g, '&amp;')
        .replace(/</g, '&lt;')
        .replace(/>/g, '&gt;')
        .replace(/"/g, '&quot;')
        .replace(/'/g, '&#039;')
      )
    }
  }
}
</script>

As a comparison, vue-markdown is over 100kb. This won't render math formulas and such, but 99.99% of people won't use it for those things, so not sure why the most popular markdown components are so bloated :(

This is safe to XSS attacks and super fast.

Why did I use &#039; and not &apos;? Because: Why shouldn't `&apos;` be used to escape single quotes?

And now for something completely different, but related...

Not sure why this hasn't been mentioned yet... but your browser can sanitize for you.

Here is the 3-line HTML sanitizer that can sanitize 30x faster than any JavaScript variant by using the assembly language version that comes with your browser... This is used in Vue/React/Angular and many other UI frameworks. Note this does NOT escape HTML, it removes it.

const decoder = document.createElement('div')
decoder.innerHTML = YourXSSAttackHere
const sanitized = decoder.textContent

As proof this method is accepted and fast, here is a live link to the decoder used in Vue.js which uses the same pattern: https://github.com/vuejs/vue/blob/dev/src/compiler/parser/entity-decoder.js

Sunderland answered 29/4, 2020 at 23:45 Comment(6)
The question is about html SANITIZING, not REMOVINGSorb
Uh... Not only do I talk about sanitizing, I actually provided an example of the exact use case mentioned... sanitizing for markdown, and I did it in < 2kb; way smaller than anyone else, very cleanly, and it prevents XSS. The answer literally covers the entire question. I think it's very easy to argue my answer is the only complete answer provided in over a decade. You are down voting me because I also mention removing? Is that helpful to the community? :/ Would you down vote your mechanic if you take your car in for an oil change and they mention you have a gasket leak? Harsh bro.Sunderland
I didn't downvote your answer. But I'm pointing out that "sanitizing" means "remove unwanted tags/attributes only, keeping everything else intact" which is quite different from "encoding" or "cleaning out". The textContent property in your example removes all HTML markup and returns text only.Sorb
But I have provided 2 examples. I will put the sanitize one first. Maybe this will fix your concern?Sunderland
This is escaping, not sanitizing. en.wikipedia.org/wiki/HTML_sanitization Even Snarkdown homepage says "Snarkdown does not sanitize html"Sorb
Ah! Thank you for clarifying and for explaining. I appreciate it a lot.Sunderland

© 2022 - 2024 — McMap. All rights reserved.