Disclaimer: Please bare with the length of this question. This is a recurring question for a real world problem that I've seen asked hundreds of times with no clear, working solution ever being presented.
I have hundreds of HTML files I want to mass indent using PHP. At first I thought of using Tidy but as you should know, it's not compatible by default with HTML5 tags and attributes, after some research and even more tests I came up with the following implementation that "fakes" HTML 5 support:
function Tidy5($string, $options = null, $encoding = 'utf8')
{
$tags = array();
$default = array
(
'anchor-as-name' => false,
'break-before-br' => true,
'char-encoding' => $encoding,
'decorate-inferred-ul' => false,
'doctype' => 'omit',
'drop-empty-paras' => false,
'drop-font-tags' => true,
'drop-proprietary-attributes' => false,
'force-output' => true,
'hide-comments' => false,
'indent' => true,
'indent-attributes' => false,
'indent-spaces' => 2,
'input-encoding' => $encoding,
'join-styles' => false,
'logical-emphasis' => false,
'merge-divs' => false,
'merge-spans' => false,
'new-blocklevel-tags' => ' article aside audio details dialog figcaption figure footer header hgroup menutidy nav section source summary track video',
'new-empty-tags' => 'command embed keygen source track wbr',
'new-inline-tags' => 'btidy canvas command data datalist embed itidy keygen mark meter output progress time wbr',
'newline' => 0,
'numeric-entities' => false,
'output-bom' => false,
'output-encoding' => $encoding,
'output-html' => true,
'preserve-entities' => true,
'quiet' => true,
'quote-ampersand' => true,
'quote-marks' => false,
'repeated-attributes' => 1,
'show-body-only' => true,
'show-warnings' => false,
'sort-attributes' => 1,
'tab-size' => 4,
'tidy-mark' => false,
'vertical-space' => true,
'wrap' => 0,
);
$doctype = $menu = null;
if ((strncasecmp($string, '<!DOCTYPE', 9) === 0) || (strncasecmp($string, '<html', 5) === 0))
{
$doctype = '<!DOCTYPE html>'; $options['show-body-only'] = false;
}
$options = (is_array($options) === true) ? array_merge($default, $options) : $default;
foreach (array('b', 'i', 'menu') as $tag)
{
if (strpos($string, '<' . $tag . ' ') !== false)
{
$tags[$tag] = array
(
'<' . $tag . ' ' => '<' . $tag . 'tidy ',
'</' . $tag . '>' => '</' . $tag . 'tidy>',
);
$string = str_replace(array_keys($tags[$tag]), $tags[$tag], $string);
}
}
$string = tidy_repair_string($string, $options, $encoding);
if (empty($string) !== true)
{
foreach ($tags as $tag)
{
$string = str_replace($tag, array_keys($tag), $string);
}
if (isset($doctype) === true)
{
$string = $doctype . "\n" . $string;
}
return $string;
}
return false;
}
It works but has 2 flaws: HTML comments, script
and style
tags are not correctly indented:
<link href="/_/style/form.css" rel="stylesheet" type="text/css"><!--[if lt IE 9]>
<script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
<![endif]-->
<!--<script type="text/javascript" src="//raw.github.com/kevinburke/tecate/master/tecate.js"></script>-->
</script><script charset="UTF-8" src="//cdnjs.cloudflare.com/ajax/libs/bootstrap-datepicker/1.0.0/js/locales/bootstrap-datepicker.pt.js" type="text/javascript">
</script><!--<script src="/3rd/parsley/i18n/messages.pt_br.js"></script>-->
<!--<script src="//cdnjs.cloudflare.com/ajax/libs/parsley.js/1.1.10/parsley.min.js"></script>-->
<script src="/3rd/select2/locales/select2_locale_pt-PT.js" type="text/javascript">
</script><script src="/3rd/tcrosen/bootstrap-typeahead.js" type="text/javascript">
And the other flaw, which is way more critical: Tidy converts all menu
tags to ul
and insists on dropping any empty inline tag, forcing me to hack my way around it. To make that absolutely clear, here are some examples:
<br>
empty tag<i>text</i>
inline tag<i class="icon-home"></i>
empty inline tag (example from Font Awesome)
If you inspect the code, you'll notice that I've accounted for b
, i
and menu
tags using a not-perfect str_replace
hack - I could have used a more robust regular expression or even str_ireplace
to accomplish the same thing, but for my purposes str_replace
is faster and good enough. However, that still leaves behind any other empty inline tags that I haven't accounted for, which sucks.
So I turned to DOMDocument
, but I soon discovered that in order for formatOutput
to work I have to:
- strip all whitespace between tags (using a regex of course:
'~>[[:space:]]++<~m'
>><
) - convert all newline combinations to
\n
so it doesn't encode\r
as
for instance - load the input string as HTML, output as XML
To my surprise, DOMDocument also has problems with empty inline tags, basically, whenever it sees <i class="icon-home"></i><someOtherTag>text</someOtherTag>
or similar, it will turn that to <i class="icon-home"><someOtherTag>text</someOtherTag></i>
which will completely mess up the browser rendering of the page. To overcome that, I've found that using LIBXML_NOEMPTYTAG
along with DOMDocument::saveXML()
will turn any tag without content (including truly empty tags such as <br />
) into a inline closing tag, so for instance:
<i class="icon-home"></i>
stays the same (as it should)<br>
becomes<br></br>
messing up the browser rendering (yet again)
To fix that, I have to use a regular expression that looks for ~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~
and replaces the matched string with a simple />
. One other major problem with saveXML()
is that it adds <![CDATA[
.. ]]>
blocks around my script
and style
inner HTML, which renders their contents invalid and I have to go back and preg_replace
those tokens again. This "works":
function DOM5($html)
{
$dom = new \DOMDocument();
if (libxml_use_internal_errors(true) === true)
{
libxml_clear_errors();
}
$html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
$html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);
if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
{
$dom->formatOutput = true;
if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
{
$regex = array
(
'~' . preg_quote('<![CDATA[', '~') . '~' => '',
'~' . preg_quote(']]>', '~') . '~' => '',
'~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
);
return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
}
}
return false;
}
Seems like the two most recommended and validated methods of indenting HTML don't produce correct or reliable results for HTML5 in-the-wild, and I have to succumb to the dark god Cthulhu.
I did try other libraries, such as:
- html5lib - couldn't get
DOMDocument::$formatOutput
to work - tidy-html5 - same problems as normal
tidy
, except it supports HTML5 tags / attributes
At this point, I'm considering writing something that works only with regexes if no better solution exists. But I thought that perhaps DOMDocument
could be forced to work with HTML5 and script
/ style
tags by using a custom XSLT. I've never played around with XSLTs before so I don't know if this is realistic or not, perhaps one of you XML experts could tell me and perhaps provide a starting point.
<!--
can appear inscript
tags and have a different meaning. The major problem with Tidy is the empty inline tags though, that's why I tried DOMDocument. – Naara$default
array. I copied the new HTML5 tags from the W3C forked tidy-html5 (and I added 3 others to account for the hacks I needed). I also read the entire Tidy manual and couldn't find any option that wouldn't drop empty tags. I know they are not semantically valid, but they are used everywhere and I would prefer following a realistic approach rather than super-zealous one. – Naaracontent
property of CSS selectors or JavascriptinnerHTML
. That happens a lot in the real web nowadays. – Naara