Asked 17/7, 2013 at 5:26 Answered 8/7, 2014 at 22:12

I want to be able to accept HTML from untrusted users and sanitize it so that I can safely include it in pages on my website. By this I mean that markup should not be stripped or escaped, but should be passed through essentially unchanged unless it contains dangerous tags such as <script> or <iframe>, dangerous attributes such as onload, or dangerous CSS properties such as background URLs. (Apparently some older IEs will execute javascript URLs in CSS?)

Serving the content from a different domain, enclosed in an iframe, is not a good option because there is no way to tell in advance how tall the iframe has to be so it will always look ugly for some pages.

I looked into HTML Purifier, but it looks like it doesn't support HTML5 yet. I also looked into Google Caja, but I'm looking for a solution that doesn't use scripts.

Does anyone know of a library that will accomplish this? PHP is preferred, but beggars can't be choosers.

Erato answered 17/7, 2013 at 5:26 Comment(2)

You may try raxan data sanitizer – Banksia 2/7, 2014 at 13:57

Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it. – Schreibe 2/7, 2014 at 14:46

The black listing approach puts you under upgrade pressure. So each time browsers start to support new standards you MUST draw your sanitizing tool to the same level. Such changes happen more often than you think.

White listing (which is achieved by strip_tags with well defined exceptions) of cause shrinks options for your users, but puts you on the save site.

On my own sites I have the policy to apply the black listing on pages for very trusted users (such as admins) and the whitelisting on all other pages. That sets me into the position to not put much effort into the black listing. With more mature role & permission concepts you can even fine grain your black lists and white lists.

UPDATE: I guess you look for this:

I got the point that strip_tags whitelists on tag level but does accept everything on attribute level. Interestingly HTMLpurifier seems to do the whitelisting on attribute level. Thanks, was a nice learning here.

Madlynmadman answered 4/7, 2014 at 12:40 Comment(1)

strip_tags can't protect from dangerous attributes. As long as the tag is allowed, it doesn't touch the attributes at all. – Erato 4/7, 2014 at 12:53

You might be able to do something along the lines of:

preg_replace('/<\s*iframe\s+[^>]*>.*<\s*\/\s*iframe\s+[^>]*>/i', '', $html);
preg_replace('/<\s*script\s+[^>]*>.*<\s*\/\s*script\s+[^>]*>/i', '', $html);
preg_replace('/\s+onload\s+=\s+"[^"]+"/i', '', $html);

... but then again: you have RegExes, now you have two problems - this might remove more than wanted and leave more than wanted as well.

But since HTML Purifier is probably the most modern and well suited (and open source) project you should still use that one and maybe make adjustments if you really need them.

You can check out one of the following as well:

kses - de facto standard, found a way into wordpress as well
htmLawed - an further developed kses
PHP Input Filter - can filter tags and attributes

Though you also have to make sure that your own page layout doesn't take a hit in including the results due to not closed tags.

Burtis answered 2/7, 2014 at 13:32 Comment(0)

Maybe it's better to go on a different approach? How about telling them what they can use?

In that case you can use use strip_tags. It will be easier and a lot more controllable this way. Very easy to extend in the future aswell

Nottingham answered 2/7, 2014 at 13:37 Comment(1)

> This function does not modify any attributes on the tags that you allow using allowable_tags, including the style and onmouseover attributes that a mischievous user may abuse when posting text that will be shown to other users. – Erato 2/7, 2014 at 14:30

On Ruby I'm using Nokogiri (php version) to parse HTML content. You can parse user's data and remove unnecessary tags or attributes, and then convert it to text.

phpQuery - another parser.

And in PHP there is a strip_tags function.

Or you can manualy remove all attributes:

$dom = new DOMDocument;
$dom -> loadHTML( $html );
$xpath = new DOMXPath( $dom );
$nodes = $xpath -> query( "//*[@style]" ); // all elements with style attribute
foreach ( $nodes as $node ) {
    // remove or do what you want
    $node -> removeAttribute( "style" );
}
echo $dom -> saveHTML();

Obscurity answered 7/7, 2014 at 11:31 Comment(1)

@Brian he's work, but not good. Better use github.com/html5lib/html5lib-php – Obscurity 9/7, 2014 at 5:12

See WdHTMLParser class. I use this class for my forum.

Sample with WdHTMLParser :

This class parse the HTML to an array :

<div>
    <span>
        <br />
        <span>
        un bout de texte
        </span>
        <input type="text" />
    </span>
</div>

Array :

Array (
 [0] => Array (
  [name] => div
  [args] => Array ()
  [children] => Array (
   [0] => Array (
    [name] => span
    [args] => Array ()
    [children] => Array (
     [0] => Array (
      [name] => br
      [args] => Array ()
     )
     [1] => Array (
      [name] => span
      [args] => Array ()
      [children] => Array (
       [0] => un bout de texte
      )
     )
     [2] => Array (
      [name] => input
      [args] => Array (
       [type] => text
      )
     )
    )
   )
  )
 )
)

WdHTMLParser array to HTML

I use this class on my website to convert array to HTML.

voyageWdHTML_allowattr : These attributes will be allowed.
voyageWdHTML_allowtag : These tags will be allowed.
voyageWdHTML_special : Make your own rules. Actually, I add "_blank" to each link. And replace <br> to new line (\n) in pre tag.
fix_javascript : You can to enable/disable this function, but it is useless.

Sample php :

<?php
include "WdHTMLParser.php";
include "parser.php";

list($erreur, $message) = (new Parser())->parseBadHTML("<div>
    <span>
        <a onclick=\"alert('Hacked ! :'(');\">Check javascript</a>
        <script>alert(\"lol\");</script>
    </span>
</div>");

if ($erreur) {
    die("Error : ".$message);
}

echo $message;

Output :

<div>
    <span>
        <a target="_blank">Check javascript</a>
        <pre>alert("lol");</pre>
    </span>
</div>

My Parser class :

<?php
class Parser {
    //private function fix_javascript(&$message) { }

    private function voyageWdHTML_args($tab_args, $objname) {
        $html = "";
        foreach ($tab_args as $attr => $valeur) {
            if ($valeur !== null && $this->voyageWdHTML_allowattr($attr)) {
                $html .= " $attr=\"".htmlentities($valeur)."\"";
            }
        }
        return $html;
    }

    private function voyageWdHTML_allowattr($attr) {
        return in_array($attr, array("align", "face", "size", "href", "title", "target", "src", "color", "style",
                                    "data-class", "data-format"));
    }

    private function voyageWdHTML_allowtag($name) {
        return in_array($name, array("br", "b", "i", "u", "strike", "sub", "sup", "div", "ol", "ul", "li", "font", "span", "code",
                                    "hr", "blockquote", "cite", "a", "img", "p", "pre", "h6", "h5", "h4", "h3", "h2", "h1"));
    }

    private function voyageWdHTML_special(&$obj) {
        if ($obj["name"] == "a") { $obj["args"]["target"] = "_blank"; }
        if ($obj["name"] == "pre") {
            array_filter($obj["children"], function (&$var) {
                if (is_string($var)) { return true; }
                if ($var["name"] == "br") { $var = "\n"; return true; }
                return false;
            });
        }
    }

    private function voyageWdHTML($tableau, $lvl = 0) {
        $html = "";
        foreach ($tableau as $obj) {
            if (is_array($obj)) {
                if (!$this->voyageWdHTML_allowtag($obj["name"])) {
                    $obj["name"] = "pre";
                    if (!isset($obj["children"])) {
                        $obj["children"] = array();
                    }
                }
                if (isset($obj["children"])) {
                    $this->voyageWdHTML_special($obj);
                    $html .= "<{$obj["name"]}{$this->voyageWdHTML_args($obj["args"], $obj["name"])}>{$this->voyageWdHTML($obj["children"], $lvl+1)}</{$obj["name"]}>";
                } else {
                    $html .= "<{$obj["name"]}>";
                }
            } else {
                $html .= $obj;
            }
        }
        return $html;
    }

    public function parseBadHTML($message) {
        $WdHTMLParser = new WdHTMLParser();
        $message = str_replace(array("<br>", "<hr>"), array("<br/>", "<hr/>"), $message);
        $tableau = $WdHTMLParser->parse($message);

        if ($WdHTMLParser->malformed) {
            $retour = $WdHTMLParser->error;
        } else {
            $retour = $this->voyageWdHTML($tableau);

            //$this->fix_javascript($retour);// To make sur
        }

        return array($WdHTMLParser->malformed, $retour);
    }
}

WdHTMLParser class

<?php
class WdHTMLParser {
    private $encoding;
    private $matches;
    private $escaped;
    private $opened = array();
    public $malformed;
    public function parse($html, $namespace = NULL, $encoding = 'utf-8') {
        $this->malformed = false;
        $this->encoding  = $encoding;
        $html            = $this->escapeSpecials($html);
        $this->matches   = preg_split('#<(/?)' . $namespace . '([^>]*)>#', $html, -1, PREG_SPLIT_DELIM_CAPTURE);
        $tree            = $this->buildTree();
        if ($this->escaped) {
            $tree = $this->unescapeSpecials($tree);
        }
        return $tree;
    }
    private function escapeSpecials($html) {
        $html = preg_replace_callback('#<\!--.+-->#sU', array($this, 'escapeSpecials_callback'), $html);
        $html = preg_replace_callback('#<\?.+\?>#sU', array($this, 'escapeSpecials_callback'), $html);
        return $html;
    }
    private function escapeSpecials_callback($m) {
        $this->escaped = true;
        $text          = $m[0];
        $text          = str_replace(array('<', '>'), array("\x01", "\x02"), $text);
        return $text;
    }
    private function unescapeSpecials($tree) {
        return is_array($tree) ? array_map(array($this, 'unescapeSpecials'), $tree) : str_replace(array("\x01", "\x02"), array('<', '>'), $tree);
    }
    private function buildTree() {
        $nodes = array();
        $i     = 0;
        $text  = NULL;
        while (($value = array_shift($this->matches)) !== NULL) {
            switch ($i++ % 3) {
                case 0: {
                    if (trim($value)) {
                        $nodes[] = $value;
                    }
                }
                    break;
                case 1: {
                    $closing = ($value == '/');
                }
                    break;
                case 2: {
                    if (substr($value, -1, 1) == '/') {
                        $nodes[] = $this->parseMarkup(substr($value, 0, -1));
                    } else if ($closing) {
                        $open = array_pop($this->opened);
                        if ($value != $open) {
                            $this->error($value, $open);
                        }
                        return $nodes;
                    } else {
                        $node             = $this->parseMarkup($value);
                        $this->opened[]   = $node['name'];
                        $node['children'] = $this->buildTree($this->matches);
                        $nodes[]          = $node;
                    }
                }
            }
        }
        return $nodes;
    }
    public function parseMarkup($markup) {
        preg_match('#^[^\s]+#', $markup, $matches);
        $name = $matches[0];
        preg_match_all('#\s+([^=]+)\s*=\s*"([^"]+)"#', $markup, $matches, PREG_SET_ORDER);
        $args = array();
        foreach ($matches as $m) {
            $args[$m[1]] = html_entity_decode($m[2], ENT_QUOTES, $this->encoding);
        }
        return array('name' => $name, 'args' => $args);
    }
    public function error($markup, $expected) {
        $this->malformed = true;
        printf('unexpected closing markup "%s", should be "%s"', $markup, $expected);
    }
}

To make sur use, you can use this function (mybb.com) :

<?php
class Parser {
    private function fix_javascript(&$message) {
        $js_array = array(
            "#(&\#(0*)106;?|&\#(0*)74;?|&\#x(0*)4a;?|&\#x(0*)6a;?|j)((&\#(0*)97;?|&\#(0*)65;?|a)(&\#(0*)118;?|&\#(0*)86;?|v)(&\#(0*)97;?|&\#(0*)65;?|a)(\s)?(&\#(0*)115;?|&\#(0*)83;?|s)(&\#(0*)99;?|&\#(0*)67;?|c)(&\#(0*)114;?|&\#(0*)82;?|r)(&\#(0*)105;?|&\#(0*)73;?|i)(&\#112;?|&\#(0*)80;?|p)(&\#(0*)116;?|&\#(0*)84;?|t)(&\#(0*)58;?|\:))#i",
            "#(o)(nmouseover\s?=)#i",
            "#(o)(nmouseout\s?=)#i",
            "#(o)(nmousedown\s?=)#i",
            "#(o)(nmousemove\s?=)#i",
            "#(o)(nmouseup\s?=)#i",
            "#(o)(nclick\s?=)#i",
            "#(o)(ndblclick\s?=)#i",
            "#(o)(nload\s?=)#i",
            "#(o)(nsubmit\s?=)#i",
            "#(o)(nblur\s?=)#i",
            "#(o)(nchange\s?=)#i",
            "#(o)(nfocus\s?=)#i",
            "#(o)(nselect\s?=)#i",
            "#(o)(nunload\s?=)#i",
            "#(o)(nkeypress\s?=)#i"
        );
        
        $message = preg_replace($js_array, "$1<b></b>$2$4", $message);
    }
}

Excoriate answered 8/7, 2014 at 18:9 Comment(0)

I decided to just use html5lib-python. This is what I came up with:

#!/usr/bin/env python
import sys
from xml.dom.minidom import Node
import html5lib
from html5lib import (HTMLParser, sanitizer, serializer, treebuilders,
                     treewalkers)

parser = HTMLParser(tokenizer=sanitizer.HTMLSanitizer,
                    tree=treebuilders.getTreeBuilder("dom"))
serializer = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False)

document = parser.parse(sys.stdin.read(), encoding="utf-8")
# find the <html> node
for child in document.childNodes:
    if child.nodeType == Node.ELEMENT_NODE and child.nodeName == 'html':
        htmlNode = child 
# find the <body> node
for child in htmlNode.childNodes:
    if child.nodeType == Node.ELEMENT_NODE and child.nodeName == 'body':
        bodyNode = child
# serialize all children of the <body> node
for child in bodyNode.childNodes:
    stream = treewalkers.getTreeWalker("dom")(child)
    sys.stdout.write(serializer.render(stream, encoding="utf-8"))

Example input:

<script>alert("hax")</script>
<p onload="alert('this is a dangerous attribute')"><b>hello,</b> world</p>

Example output:

&lt;script&gt;alert("hax")&lt;/script&gt;
<p><b>hello,</b> world</p>

Erato answered 8/7, 2014 at 21:33 Comment(1)

Edit: This only works in Python 2. I have a version that works in Python 3 too, but I'm not going to post it because it's a bit hackish. – Erato 8/7, 2014 at 22:29

I personally use HTML Purifier for this exact purpose:

http://htmlpurifier.org/docs

It works well and allows you to customize down to every tag and attribute. So far I have had no security issues with this plugin.

Cort answered 8/7, 2014 at 22:12 Comment(2)

HTML Purifier doesn't support HTML5 yet. – Erato 8/7, 2014 at 22:13

But it allows you to define your own tags and attributes :) – Cort 8/7, 2014 at 22:13

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++