using htmlpurifier for input or output escaping/filtering

Asked 24/5, 2010 at 13:54 Answered 24/5, 2010 at 14:11

I am processing a user input from the public with a javascript WYSIWYG editor and I'm planning on using htmlpurifier to cleanse the text.

I thought it would be enough to use htmlpurifier on the input, stored the cleaned input in the database,and then output it without further escaping/filtering. But I've heard other opinions that you should always escape the output.

Can someone explain why I should need to clean the output if I'm already cleaning the input?

Schofield answered 24/5, 2010 at 13:54 Comment(0)

I assume your WYSIWYG editor generates HTML, which is then validated and put in the database. In that case, the validation already took place, so there is no need to validate twice.

As to "escaping output", that's a different matter. You cannot escape the resulting HTML, otherwise you won't have formatted text, and the tags will be visible. Escaping the output is used when you do not want said output to interfere with the markup of the page.

I'd add you have to be very careful with what you allow in your validation phase. You will probably only want to allow a few HTML tags and attributes.

Christabelle answered 24/5, 2010 at 14:0 Comment(5)

The problem with relying on the js editor is that a malicious user could submit a post bypassing whatever checks the js has. – Schofield 24/5, 2010 at 14:4

@Col - yes - but Artefacto was saying that the js "validated" the html - so there is no need to validate twice (meaning to use htmlpurifier) – Schofield 24/5, 2010 at 14:10

No, he's saying you don't need to validate again after reading the data back out of the database. – Gerdy 24/5, 2010 at 14:13

@Schofield I'm not. You have to validate it after submisson (with HTML Purifier, if you wish) and before inserting it in the database, just not everytime you fetch it from the database. – Christabelle 24/5, 2010 at 14:14

@artefacto - thanks for the clarification and answer - I didn't read it properly – Schofield 24/5, 2010 at 14:17

To be 100% safe, use HTMLPurifier twice. Before saving the HTML to DB and before outputting it to screen.
The huge drawback of such solution is performance. HTMLPurifier is ultraslow when filtering HTML and you might encounter longer processing times of your pages.

You should be ok if you perform only 1-2 filterings before outputting something to screen, but if you do 10 filterings per request like we did, we rather decided not to use HTMLPurifier when outputting large amounts of texts to keep.

HTMLPurifier took 60% of processing time per request and we wanted to achieve low response times and higher UX instead.

It depends on your situation. If you can afford using HTMLPurifier before outputting, go for it - it's better and you always have control over what tags you want to allow (for new and even for old content stored in your db).

Hooey answered 24/5, 2010 at 14:11 Comment(4)

Thanks for your post - but can you explain a case in which I would need to do it twice? eg if I do: $id = (int)$_POST['id']; $db->query("select * from users where id = ".int_val($id)); have I gained anything in security? – Schofield 24/5, 2010 at 14:15

The second filtering (before output) is helpful in cases where someone has hacked into your db server but did not manage to break into your web server. The attacker can easily change any content in your db and if you're not filtering HTML before outputting you have pretty serious security problem. However I believe this is a very rare scenario. – Hooey 24/5, 2010 at 14:25

this is ridiculuos scenario I'd say – Hatti 24/5, 2010 at 14:33

I agree, it's rare, but the side effect of filtering before output is also that if you decide you no longer want to allow certain tag (ie <img>), it's pretty simple for all your content. If there was no filtering before output you would have to go through each entry and remove the tags. – Hooey 24/5, 2010 at 15:46

The mantra always escape your output, which is a Text to HTML conversion, is a good and reasonable default to fall back to when working in the web space. In the case of HTML Purifier, you are specifically breaking this good advice, because you are indeed performing an HTML to HTML conversion and treating the HTML as Text again doesn't really make sense.

Responsive answered 24/5, 2010 at 13:58 Comment(2)

thanks for the answer but I didn't quite follow - are you saying that once using htmlpurifier, it could be treated as safe? – Schofield 24/5, 2010 at 14:2

I think it depends on context. If you allow users to write a blog post, you use HTMLPurifier to decide which tags they are allowed to use. Once done, you need that HTML output as HTML. You don't want to treat it as text which escaping does, otherwise if the user bolded a word, it would be escaped and show as <b>word</b>. Perhaps Edward will come back to confirm my comment. – Prosit 21/11, 2010 at 0:39

Recommended topics

Hot tags