Problems Indenting HTML(5) with PHP
Asked Answered
N

2

2

Disclaimer: Please bare with the length of this question. This is a recurring question for a real world problem that I've seen asked hundreds of times with no clear, working solution ever being presented.

I have hundreds of HTML files I want to mass indent using PHP. At first I thought of using Tidy but as you should know, it's not compatible by default with HTML5 tags and attributes, after some research and even more tests I came up with the following implementation that "fakes" HTML 5 support:

function Tidy5($string, $options = null, $encoding = 'utf8')
{
    $tags = array();
    $default = array
    (
        'anchor-as-name' => false,
        'break-before-br' => true,
        'char-encoding' => $encoding,
        'decorate-inferred-ul' => false,
        'doctype' => 'omit',
        'drop-empty-paras' => false,
        'drop-font-tags' => true,
        'drop-proprietary-attributes' => false,
        'force-output' => true,
        'hide-comments' => false,
        'indent' => true,
        'indent-attributes' => false,
        'indent-spaces' => 2,
        'input-encoding' => $encoding,
        'join-styles' => false,
        'logical-emphasis' => false,
        'merge-divs' => false,
        'merge-spans' => false,
        'new-blocklevel-tags' => ' article aside audio details dialog figcaption figure footer header hgroup menutidy nav section source summary track video',
        'new-empty-tags' => 'command embed keygen source track wbr',
        'new-inline-tags' => 'btidy canvas command data datalist embed itidy keygen mark meter output progress time wbr',
        'newline' => 0,
        'numeric-entities' => false,
        'output-bom' => false,
        'output-encoding' => $encoding,
        'output-html' => true,
        'preserve-entities' => true,
        'quiet' => true,
        'quote-ampersand' => true,
        'quote-marks' => false,
        'repeated-attributes' => 1,
        'show-body-only' => true,
        'show-warnings' => false,
        'sort-attributes' => 1,
        'tab-size' => 4,
        'tidy-mark' => false,
        'vertical-space' => true,
        'wrap' => 0,
    );

    $doctype = $menu = null;

    if ((strncasecmp($string, '<!DOCTYPE', 9) === 0) || (strncasecmp($string, '<html', 5) === 0))
    {
        $doctype = '<!DOCTYPE html>'; $options['show-body-only'] = false;
    }

    $options = (is_array($options) === true) ? array_merge($default, $options) : $default;

    foreach (array('b', 'i', 'menu') as $tag)
    {
        if (strpos($string, '<' . $tag . ' ') !== false)
        {
            $tags[$tag] = array
            (
                '<' . $tag . ' ' => '<' . $tag . 'tidy ',
                '</' . $tag . '>' => '</' . $tag . 'tidy>',
            );

            $string = str_replace(array_keys($tags[$tag]), $tags[$tag], $string);
        }
    }

    $string = tidy_repair_string($string, $options, $encoding);

    if (empty($string) !== true)
    {
        foreach ($tags as $tag)
        {
            $string = str_replace($tag, array_keys($tag), $string);
        }

        if (isset($doctype) === true)
        {
            $string = $doctype . "\n" . $string;
        }

        return $string;
    }

    return false;
}

It works but has 2 flaws: HTML comments, script and style tags are not correctly indented:

<link href="/_/style/form.css" rel="stylesheet" type="text/css"><!--[if lt IE 9]>
    <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!--<script type="text/javascript" src="//raw.github.com/kevinburke/tecate/master/tecate.js"></script>-->

</script><script charset="UTF-8" src="//cdnjs.cloudflare.com/ajax/libs/bootstrap-datepicker/1.0.0/js/locales/bootstrap-datepicker.pt.js" type="text/javascript">
</script><!--<script src="/3rd/parsley/i18n/messages.pt_br.js"></script>-->
    <!--<script src="//cdnjs.cloudflare.com/ajax/libs/parsley.js/1.1.10/parsley.min.js"></script>-->
    <script src="/3rd/select2/locales/select2_locale_pt-PT.js" type="text/javascript">
</script><script src="/3rd/tcrosen/bootstrap-typeahead.js" type="text/javascript">

And the other flaw, which is way more critical: Tidy converts all menu tags to ul and insists on dropping any empty inline tag, forcing me to hack my way around it. To make that absolutely clear, here are some examples:

  • <br> empty tag
  • <i>text</i> inline tag
  • <i class="icon-home"></i> empty inline tag (example from Font Awesome)

If you inspect the code, you'll notice that I've accounted for b, i and menu tags using a not-perfect str_replace hack - I could have used a more robust regular expression or even str_ireplace to accomplish the same thing, but for my purposes str_replace is faster and good enough. However, that still leaves behind any other empty inline tags that I haven't accounted for, which sucks.

So I turned to DOMDocument, but I soon discovered that in order for formatOutput to work I have to:

  1. strip all whitespace between tags (using a regex of course: '~>[[:space:]]++<~m' > ><)
  2. convert all newline combinations to \n so it doesn't encode \r as &#23; for instance
  3. load the input string as HTML, output as XML

To my surprise, DOMDocument also has problems with empty inline tags, basically, whenever it sees <i class="icon-home"></i><someOtherTag>text</someOtherTag> or similar, it will turn that to <i class="icon-home"><someOtherTag>text</someOtherTag></i> which will completely mess up the browser rendering of the page. To overcome that, I've found that using LIBXML_NOEMPTYTAG along with DOMDocument::saveXML() will turn any tag without content (including truly empty tags such as <br />) into a inline closing tag, so for instance:

  • <i class="icon-home"></i> stays the same (as it should)
  • <br> becomes <br></br> messing up the browser rendering (yet again)

To fix that, I have to use a regular expression that looks for ~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~ and replaces the matched string with a simple />. One other major problem with saveXML() is that it adds <![CDATA[ .. ]]> blocks around my script and style inner HTML, which renders their contents invalid and I have to go back and preg_replace those tokens again. This "works":

function DOM5($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}

Seems like the two most recommended and validated methods of indenting HTML don't produce correct or reliable results for HTML5 in-the-wild, and I have to succumb to the dark god Cthulhu.

I did try other libraries, such as:

  • html5lib - couldn't get DOMDocument::$formatOutput to work
  • tidy-html5 - same problems as normal tidy, except it supports HTML5 tags / attributes

At this point, I'm considering writing something that works only with regexes if no better solution exists. But I thought that perhaps DOMDocument could be forced to work with HTML5 and script / style tags by using a custom XSLT. I've never played around with XSLTs before so I don't know if this is realistic or not, perhaps one of you XML experts could tell me and perhaps provide a starting point.

Naara answered 18/6, 2013 at 15:25 Comment(10)
Cthulhu says use regex to do the comments!Insurgent
@Precastic: With Tidy? Either way, it's still unpredictable, <!-- can appear in script tags and have a different meaning. The major problem with Tidy is the empty inline tags though, that's why I tried DOMDocument.Naara
What are the TIDY options you are using? There are options specify a lot of the "problems" you are seeing. As for the empty tags -- empty tags are not, typically, semantically valid.Wester
@JacobS: You can check the Tidy options that I'm using in the $default array. I copied the new HTML5 tags from the W3C forked tidy-html5 (and I added 3 others to account for the hacks I needed). I also read the entire Tidy manual and couldn't find any option that wouldn't drop empty tags. I know they are not semantically valid, but they are used everywhere and I would prefer following a realistic approach rather than super-zealous one.Naara
@JacobS: And I don't think Tidy is doing the right thing (semantically) by dropping empty tags without giving the user the final choice. After all, they have a class and their content could be populated using the content property of CSS selectors or Javascript innerHTML. That happens a lot in the real web nowadays.Naara
Unfortunately, that is the choice that Tidy made (drop any tag that does not contain an "id" or "name" attribute). Tidy is specifically intended to produce semantically valid output and you are trying to use it to NOT fix problems. Anyway, you say that you used tidy-html5 -- did you try setting "drop-empty-elements" to no? Seems to work fine for me (although you're not really going to prevent it from trying to fix any malformed html problems). Otherwise, I'd suggest finding (or making) a better tool than trying to make something do what it isn't intended for.Wester
@JacobS: I understand that, I just wish I had some saying about it, after all semantically incorrect is not the same as malformed HTML. But what you say it's true, I don't need any "fixes" I just want to indent HTML, if Tidy doesn't do it, there must be some library that does it. I will try the option you mentioned for tidy-html5 after dinner, thanks for the suggestion.Naara
It's neither here nor there, but "technically", the html spec says that <i>,<p> and similar can only be used to wrap text or text elements, so it is technically malformed -- but I agree that it doesn't necessarily make sense. Anyway, I haven't used it, but maybe you can take a look at js-beautify which has a python command-line version.Wester
@AlixAxel My first comment was an attempt at humour :) Not quite sure why formatting is so important for you but if it is then why not write your own parser in PHP. You then have full control over it & would probably take you less time than trying to configure tools that don't "quite" do what you want.Insurgent
possible duplicate of Tidying HTML5 Output Indentation in PHPGuerrero
G
1

You have not mentioned whether your intention is to transform pages for production purposes or for development, e.g. when debugging HTML output.

If it is the latter, and since you have mentioned writing Regex based solution already, I have written Dindent for that purpose.

You have not included sample of input and expected output. You can test my implementation using the sandbox.

Guerrero answered 22/2, 2014 at 12:41 Comment(0)
L
0

to beautify my HTML5-code I wrote a small PHP-Class. It's not perfect, but basically does the stuff for my purpose in a relatively quick way. Maybe it's usefull.

<?php
namespace LBR\LbrService;

/**
 * This script has no licensing-model - do what you want to do with it.
 * 
 * This script is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 *  
 * @author 2014 sunixzs <[email protected]>
 *
 * What does this script do?
 * Take unlovely HTML-sourcecode, remove temporarily any sections that should not 
 * be processed (p.e. textarea, pre and script), then remove all spaces and linebreaks
 * to define them new by referencing some tag-lists. After this intend the new created
 * lines also by refence to tag-lists. At the end put the temporary stuff back to the
 * new generated hopefully beautiful sourcecode.
 *
 */
class BeautifyMyHtml {

    /**
     * HTML-Tags which should not be processed.
     * Only tags with opening and closing tag does work: <example some="attributes">some content</example>
     * <img src="some.source" alt="" /> does not work because of the short end.
     * 
     * @var array
     */
    protected $tagsToIgnore = array (
            'script',
            'textarea',
            'pre',
            'style' 
    );

    /**
     * Code-Blocks which should not be processed are temporarily stored in this array.
     * 
     * @var array
     */
    protected $tagsToIgnoreBlocks = array ();

    /**
     * The tag to ignore at currently used runtime.
     * I had to define this in class and not local in method to get the
     * possibility to access this on anonymous function in preg_replace_callback.
     * 
     * @var string
     */
    protected $currentTagToIgnore;

    /**
     * Remove white-space before and after each line of blocks, which should not be processed?
     *
     * @var boolen
     */
    protected $trimTagsToIgnore = false;

    /**
     * Character used for indentation
     * 
     * @var string
     */
    protected $spaceCharacter = "\t";

    /**
     * Remove html-comments?
     *
     * @var boolen
     */
    protected $removeComments = false;

    /**
     * preg_replace()-Pattern which define opening tags to wrap with newlines.
     * <tag> becomes \n<tag>\n
     * 
     * @var array
     */
    protected $openTagsPattern = array (
            "/(<html\b[^>]*>)/i",
            "/(<head\b[^>]*>)/i",
            "/(<body\b[^>]*>)/i",
            "/(<link\b[^>]*>)/i",
            "/(<meta\b[^>]*>)/i",
            "/(<div\b[^>]*>)/i",
            "/(<section\b[^>]*>)/i",
            "/(<nav\b[^>]*>)/i",
            "/(<table\b[^>]*>)/i",
            "/(<thead\b[^>]*>)/i",
            "/(<tbody\b[^>]*>)/i",
            "/(<tr\b[^>]*>)/i",
            "/(<th\b[^>]*>)/i",
            "/(<td\b[^>]*>)/i",
            "/(<ul\b[^>]*>)/i",
            "/(<li\b[^>]*>)/i",
            "/(<figure\b[^>]*>)/i",
            "/(<select\b[^>]*>)/i" 
    );

    /**
     * preg_replace()-Pattern which define tags prepended with a newline.
     * <tag> becomes \n<tag>
     * 
     * @var array
     */
    protected $patternWithLineBefore = array (
            "/(<p\b[^>]*>)/i",
            "/(<h[0-9]\b[^>]*>)/i",
            "/(<option\b[^>]*>)/i" 
    );

    /**
     * preg_replace()-Pattern which define closing tags to wrap with newlines.
     * </tag> becomes \n</tag>\n
     * 
     * @var array
     */
    protected $closeTagsPattern = array (
            "/(<\/html>)/i",
            "/(<\/head>)/i",
            "/(<\/body>)/i",
            "/(<\/link>)/i",
            "/(<\/meta>)/i",
            "/(<\/div>)/i",
            "/(<\/section>)/i",
            "/(<\/nav>)/i",
            "/(<\/table>)/i",
            "/(<\/thead>)/i",
            "/(<\/tbody>)/i",
            "/(<\/tr>)/i",
            "/(<\/th>)/i",
            "/(<\/td>)/i",
            "/(<\/ul>)/i",
            "/(<\/li>)/i",
            "/(<\/figure>)/i",
            "/(<\/select>)/i" 
    );

    /**
     * preg_match()-Pattern with tag-names to increase indention.
     * 
     * @var string
     */
    protected $indentOpenTagsPattern = "/<(html|head|body|div|section|nav|table|thead|tbody|tr|th|td|ul|figure|li)\b[ ]*[^>]*[>]/i";

    /**
     * preg_match()-Pattern with tag-names to decrease indention.
     * 
     * @var string
     */
    protected $indentCloseTagsPattern = "/<\/(html|head|body|div|section|nav|table|thead|tbody|tr|th|td|ul|figure|li)>/i";

    /**
     * Constructor
     */
    public function __construct() {
    }

    /**
     * Adds a Tag which should be returned as the way in source.
     * 
     * @param string $tagToIgnore
     * @throws RuntimeException
     * @return void
     */
    public function addTagToIgnore($tagToIgnore) {
        if (! preg_match( '/^[a-zA-Z]+$/', $tagToIgnore )) {
            throw new RuntimeException( "Only characters from a to z are allowed as tag.", 1393489077 );
        }

        if (! in_array( $tagToIgnore, $this->tagsToIgnore )) {
            $this->tagsToIgnore[] = $tagToIgnore;
        }
    }

    /**
     * Setter for trimTagsToIgnore.
     *
     * @param boolean $bool
     * @return void
     */
    public function setTrimTagsToIgnore($bool) {
        $this->trimTagsToIgnore = $bool;
    }

    /**
     * Setter for removeComments.
     *  
     * @param boolean $bool
     * @return void
     */
    public function setRemoveComments($bool) {
        $this->removeComments = $bool;
    }

    /**
     * Callback function used by preg_replace_callback() to store the blocks which should be ignored and set a marker to replace them later again with the blocks.
     * 
     * @param array $e
     * @return string
     */
    private function tagsToIgnoreCallback($e) {
        // build key for reference
        $key = '<' . $this->currentTagToIgnore . '>' . sha1( $this->currentTagToIgnore . $e[0] ) . '</' . $this->currentTagToIgnore . '>';

        // trim each line
        if ($this->trimTagsToIgnore) {
            $lines = explode( "\n", $e[0] );
            array_walk( $lines, function (&$n) {
                $n = trim( $n );
            } );
            $e[0] = implode( PHP_EOL, $lines );
        }

        // add block to storage
        $this->tagsToIgnoreBlocks[$key] = $e[0];

        return $key;
    }

    /**
     * The main method.
     * 
     * @param string $buffer The HTML-Code to process
     * @return string The nice looking sourcecode
     */
    public function beautify($buffer) {
        // remove blocks, which should not be processed and add them later again using keys for reference 
        foreach ( $this->tagsToIgnore as $tag ) {
            $this->currentTagToIgnore = $tag;
            $buffer = preg_replace_callback( '/<' . $this->currentTagToIgnore . '\b[^>]*>([\s\S]*?)<\/' . $this->currentTagToIgnore . '>/mi', array (
                    $this,
                    'tagsToIgnoreCallback' 
            ), $buffer );
        }

        // temporarily remove comments to keep original linebreaks
        $this->currentTagToIgnore = 'htmlcomment';
        $buffer = preg_replace_callback( "/<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->/ms", array (
                $this,
                'tagsToIgnoreCallback' 
        ), $buffer );

        // cleanup source
        // ... all in one line
        // ... remove double spaces
        // ... remove tabulators
        $buffer = preg_replace( array (
                "/\s\s+|\n/",
                "/ +/",
                "/\t+/" 
        ), array (
                "",
                " ",
                "" 
        ), $buffer );

        // remove comments, if 
        if ($this->removeComments) {
            $buffer = preg_replace( "/<!--(?!\s*(?:\[if [^\]]+]|<!|>))(?:(?!-->).)*-->/ms", "", $buffer );
        }

        // add newlines for several tags
        $buffer = preg_replace( $this->patternWithLineBefore, "\n$1", $buffer ); // tags with line before tag
        $buffer = preg_replace( $this->openTagsPattern, "\n$1\n", $buffer ); // opening tags
        $buffer = preg_replace( $this->closeTagsPattern, "\n$1\n", $buffer ); // closing tags


        // get the html each line and do indention
        $lines = explode( "\n", $buffer );
        $indentionLevel = 0;
        $cleanContent = array (); // storage for indented lines
        foreach ( $lines as $line ) {
            // continue loop on empty lines
            if (! $line) {
                continue;
            }

            // test for closing tags
            if (preg_match( $this->indentCloseTagsPattern, $line )) {
                $indentionLevel --;
            }

            // push content
            $cleanContent[] = str_repeat( $this->spaceCharacter, $indentionLevel ) . $line;

            // test for opening tags
            if (preg_match( $this->indentOpenTagsPattern, $line )) {
                $indentionLevel ++;
            }
        }

        // write indented lines back to buffer
        $buffer = implode( PHP_EOL, $cleanContent );

        // add blocks, which should not be processed
        $buffer = str_replace( array_keys( $this->tagsToIgnoreBlocks ), $this->tagsToIgnoreBlocks, $buffer );

        return $buffer;
    }
}

$BeautifyMyHtml = new \LBR\LbrService\BeautifyMyHtml();
$BeautifyMyHtml->setTrimTagsToIgnore( true );
//$BeautifyMyHtml->setRemoveComments(true);
echo $BeautifyMyHtml->beautify( file_get_contents( 'http://example.org' ) );
?>
Limbic answered 27/2, 2014 at 15:52 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.