HTML filter that is HTML5 compliant [duplicate]
Asked Answered
T

5

57

Is there a simple approach to add a HTML5 ruleset for HTMLPurifier?

HP can be configured to recognize new tags with:

// setup configurable HP instance
$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.DefinitionID', 'html5 draft');
$config->set('HTML.DefinitionRev', 1);
$config->set('Cache.DefinitionImpl', null); // no caching
$def = $config->getHTMLDefinition(true);

// add a new tag
$form = $def->addElement(
  'article',   // name
  'Block',     // content set
  'Flow',      // allowed children
  'Common',    // attribute collection
  array(       // attributes
  )
);

// add a new attribute
$def->addAttribute('a', 'contextmenu', "ID");

However this is clearly a bit of work. Since there are a lot of new HTML5 tags and attributes that had to be registered. And new global attributes should be combinable even with existing HTML 4 tags. (It's difficult to judge from the docs how to augment core rules). So, is there a more useful config format/array structure to feed new and updated tag+attribute+context configuration (inline/block/empty/flow/..) into HTMLPurifier?

# mostly confused about how to extend existing tags:
$def->addAttribute('input', 'type', "...|...|...");

# or how to allow data-* attributes (if I actually wanted that):
$def->addAttribute("data-*", ...

And of course not all new HTML5 tags are fit for unrestricted allowance. HTMLPurifier is all about content filtering. Defining value constraints is where it's at. -- <canvas> for example might not be that big of a deal when it appears in user content. Because it's useless at best without Javascript (which HP already filters out). But other tags and attributes might be undesirable; so a flexible configuration structure is imperative for enabling/disabling tags and their associated attributes.

(Guess I should update some research...). But there's still no practical compendium/specification (no, XML DTDs aren't) that suits a HP configuration.

(Uh, and HTML5 is no longer a draft.)

Tarpan answered 14/4, 2011 at 18:29 Comment(5)
Have you asked the man who can? #4566801Greenock
@thirtydot: That was probably before he added the PH5P parser thingy. Which is anyway not relevant, since you can just add new tags and the HTML4 parsing technique should work well enough on HTML5.Tarpan
You should be aware that HTML 5 is still a draft (even though it's been at "last call" for over 2 years)...and thus can change...thus supported HTML 5 of today is not necessarily that of tomorrow.Sanity
The main problem is actually, as mario points out, combing through all of the attributes and approving them manually. Which is a good thing, because I vetted every single attribute in HTML Purifier's current attribute set very carefully. It is some legwork, but it should not be too difficult for a sufficiently motivated individual. Alas, I am not presently that individual.Od
(This is one of the reasons I don't like the bounty system: for a sufficiently hard problem, someone is probably going to get the bounty for the wrong answer :-)Od
H
12

The php tidy extension can be configured to recognize html5 tags. http://tidy.sourceforge.net/docs/quickref.html#new-blocklevel-tags

Hipolitohipp answered 17/4, 2011 at 1:57 Comment(3)
You could possibly wrap the tidy cleanAndRepair in your own class and mix it with strip_tags and htmlspecialchars. Tidy also has a bare option that strips proprietary attributes, maybe that will help.Hipolitohipp
tidy_clean_repair() is probably best suited for reformatting your own code only. The drop-proprietary-attributes flag amusingly removes any new HTML5 attributes but keeps onClick= etc. So it's really no HTMLPurifier alternative.Tarpan
You can, however, use new-block-tags and such to define them.Ironhanded
C
7

There's this configuration for HTMLpurify to allow newer HTML5 tags.

Source: https://github.com/kennberg/php-htmlpurfier-html5

.

<?php
/**
 * Load HTMLPurifier with HTML5, TinyMCE, YouTube, Video support.
 *
 * Copyright 2014 Alex Kennberg (https://github.com/kennberg/php-htmlpurifier-html5)
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

require_once(LIB_DIR . 'third-party/htmlpurifier/HTMLPurifier.safe-includes.php');


function load_htmlpurifier($allowed) {
  $config = HTMLPurifier_Config::createDefault();
  $config->set('HTML.Doctype', 'HTML 4.01 Transitional');
  $config->set('CSS.AllowTricky', true);
  $config->set('Cache.SerializerPath', '/tmp');

  // Allow iframes from:
  // o YouTube.com
  // o Vimeo.com
  $config->set('HTML.SafeIframe', true);
  $config->set('URI.SafeIframeRegexp', '%^(http:|https:)?//(www.youtube(?:-nocookie)?.com/embed/|player.vimeo.com/video/)%');

  $config->set('HTML.Allowed', implode(',', $allowed));

  // Set some HTML5 properties
  $config->set('HTML.DefinitionID', 'html5-definitions'); // unqiue id
  $config->set('HTML.DefinitionRev', 1);

  if ($def = $config->maybeGetRawHTMLDefinition()) {
    // http://developers.whatwg.org/sections.html
    $def->addElement('section', 'Block', 'Flow', 'Common');
    $def->addElement('nav',     'Block', 'Flow', 'Common');
    $def->addElement('article', 'Block', 'Flow', 'Common');
    $def->addElement('aside',   'Block', 'Flow', 'Common');
    $def->addElement('header',  'Block', 'Flow', 'Common');
    $def->addElement('footer',  'Block', 'Flow', 'Common');

    // Content model actually excludes several tags, not modelled here
    $def->addElement('address', 'Block', 'Flow', 'Common');
    $def->addElement('hgroup', 'Block', 'Required: h1 | h2 | h3 | h4 | h5 | h6', 'Common');

    // http://developers.whatwg.org/grouping-content.html
    $def->addElement('figure', 'Block', 'Optional: (figcaption, Flow) | (Flow, figcaption) | Flow', 'Common');
    $def->addElement('figcaption', 'Inline', 'Flow', 'Common');

    // http://developers.whatwg.org/the-video-element.html#the-video-element
    $def->addElement('video', 'Block', 'Optional: (source, Flow) | (Flow, source) | Flow', 'Common', array(
      'src' => 'URI',
      'type' => 'Text',
      'width' => 'Length',
      'height' => 'Length',
      'poster' => 'URI',
      'preload' => 'Enum#auto,metadata,none',
      'controls' => 'Bool',
    ));
    $def->addElement('source', 'Block', 'Flow', 'Common', array(
      'src' => 'URI',
      'type' => 'Text',
    ));

    // http://developers.whatwg.org/text-level-semantics.html
    $def->addElement('s',    'Inline', 'Inline', 'Common');
    $def->addElement('var',  'Inline', 'Inline', 'Common');
    $def->addElement('sub',  'Inline', 'Inline', 'Common');
    $def->addElement('sup',  'Inline', 'Inline', 'Common');
    $def->addElement('mark', 'Inline', 'Inline', 'Common');
    $def->addElement('wbr',  'Inline', 'Empty', 'Core');

    // http://developers.whatwg.org/edits.html
    $def->addElement('ins', 'Block', 'Flow', 'Common', array('cite' => 'URI', 'datetime' => 'CDATA'));
    $def->addElement('del', 'Block', 'Flow', 'Common', array('cite' => 'URI', 'datetime' => 'CDATA'));

    // TinyMCE
    $def->addAttribute('img', 'data-mce-src', 'Text');
    $def->addAttribute('img', 'data-mce-json', 'Text');

    // Others
    $def->addAttribute('iframe', 'allowfullscreen', 'Bool');
    $def->addAttribute('table', 'height', 'Text');
    $def->addAttribute('td', 'border', 'Text');
    $def->addAttribute('th', 'border', 'Text');
    $def->addAttribute('tr', 'width', 'Text');
    $def->addAttribute('tr', 'height', 'Text');
    $def->addAttribute('tr', 'border', 'Text');
  }

  return new HTMLPurifier($config);
}
Cesaro answered 5/5, 2015 at 22:49 Comment(3)
Nice find. You might want to excerpt the relevant sections however, with basic attribution. Links are regularily becoming inaccessible (in particular for github; which there's rarely an internet archive cache of). -- It registers the basic tags at least; however lacks many new attributes, just lists a few elected data-* props for example.Tarpan
Noted, and full listing is now included in the post.Cesaro
Cool script. HTML Purifier still seems to remove (alternative) text content in the <video> tag, althoug the configuration tells not to do so ("Flow"). Any ideas, why?Torchwood
L
4

im using a fix for wordpress but maybe this can help you too (at least for the array part)

http://nicolasgallagher.com/using-html5-elements-in-wordpress-post-content/

http://hybridgarden.com/blog/misc/adding-html5-capability-to-wordpress/

also:

http://code.google.com/p/html5lib/ A Python and PHP implementations of a HTML parser based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.

Lindyline answered 1/8, 2011 at 18:57 Comment(1)
Not sure about the WP snippets, but html5lib sounds quite promising. It probably doesn't have the HP feature set, but at the very least might make an excellent transitioning solution.Tarpan
P
3

I know this topic is really old, but since it's still relevant, I decided to respond. Especially when the landscape has changed since the question was originally asked.

You can use https://github.com/xemlock/htmlpurifier-html5 which extends HTML Purifier with spec compliant definitions of HTML5 elements and attributes.

The usage is almost the same as the original HTML Purifier, you just need to replace HTMLPurifier_Config with HTMLPurifier_HTML5Config:

$config = HTMLPurifier_HTML5Config::createDefault();
$purifier = new HTMLPurifier($config);

$clean_html5 = $purifier->purify($dirty_html5);

Disclaimer: I'm the author of the extension.

Punt answered 9/8, 2019 at 12:31 Comment(0)
S
0

Gallery Role has an experimental HTML5 parser that is based on HTMLPurifier:

https://github.com/gallery/gallery3-vendor/blob/master/htmlpurifier/modified/HTMLPurifier/Lexer/PH5P.php

Scopula answered 1/8, 2011 at 20:49 Comment(2)
PH5P is part of HTMLPurifier. As far as I understood it however just implements the parsing rules according to the HTML5 SGML-esque serialization. It does not augment the HP output filter I believe (or just didn't get it to work). @EZY probably didn't have time yet to integrate or finish that.Tarpan
You can use PH5P with HTML Purifier if you want. But that doesn't add the attribute sets you need to HTML Purifier.Od

© 2022 - 2024 — McMap. All rights reserved.