PHP "pretty print" HTML (not Tidy)
Asked Answered
S

3

33

I'm using the DOM extension in PHP to build some HTML documents, and I want the output to be formatted nicely (with new lines and indentation) so that it's readable, however, from the many tests I've done:

  1. "formatOutput = true" doesn't work at all with saveHTML(), only saveXML()
  2. Even if I used saveXML(), it still only works on elements created via the DOM, not elements that are included with loadHTML(), even with "preserveWhiteSpace = false"

If anyone knows differently I'd really like to know how they got it to work.

So, I have a DOM document, and I'm using saveHTML() to output the HTML. As it's coming from the DOM I know it is valid, there's no need to "Tidy" or validate it in any way.

I'm simply looking for a way to get nicely formatted output from the output I receive from the DOM extension.

NB. As you may have guessed, I don't want to use the Tidy extension as a) it does a lot more that I need it too (the markup is already valid) and b) it actually makes changes to the HTML content (such as the HTML 5 doctype and some elements).

Follow Up:

OK, with the help of the answer below I've worked out why the DOM extension wasn't working. Although the given example works, it still wasn't working with my code. With the help of this comment I found that if you have any text nodes where isWhitespaceInElementContent() is true no formatting will be applied beyond that point. This happens regardless of whether or not preserveWhiteSpace is false. The solution is to remove all of these nodes (although I'm not sure if this may have adverse effects on the actual content).

Stain answered 20/4, 2009 at 13:21 Comment(0)
S
36

you're right, there seems to be no indentation for HTML (others are also confused). XML works, even with loaded code.

<?php
function tidyHTML($buffer) {
    // load our document into a DOM object
    $dom = new DOMDocument();
    // we want nice output
    $dom->preserveWhiteSpace = false;
    $dom->loadHTML($buffer);
    $dom->formatOutput = true;
    return($dom->saveHTML());
}

// start output buffering, using our nice
// callback function to format the output.
ob_start("tidyHTML");

?>
<html>
    <head>
    <title>foo bar</title><meta name="bar" value="foo"><body><h1>bar foo</h1><p>It's like comparing apples to oranges.</p></body></html>
<?php
// this will be called implicitly, but we'll
// call it manually to illustrate the point.
ob_end_flush();
?>

result:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<title>foo bar</title>
<meta name="bar" value="foo">
</head>
<body>
<h1>bar foo</h1>
<p>It's like comparing apples to oranges.</p>
</body>
</html>

the same with saveXML() ...

<?xml version="1.0" standalone="yes"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
  <head>
    <title>foo bar</title>
    <meta name="bar" value="foo"/>
  </head>
  <body>
    <h1>bar foo</h1>
    <p>It's like comparing apples to oranges.</p>
  </body>
</html>

probably forgot to set preserveWhiteSpace=false before loadHTML?

disclaimer: i stole most of the demo code from tyson clugg/php manual comments. lazy me.


UPDATE: i now remember some years ago i tried the same thing and ran into the same problem. i fixed this by applying a dirty workaround (wasn't performance critical): i just somehow converted around between SimpleXML and DOM until the problem vanished. i suppose the conversion got rid of those nodes. maybe load with dom, import with simplexml_import_dom, then output the string, parse this with DOM again and then printed it pretty. as far as i remember this worked (but it was really slow).

Sylviesylvite answered 20/4, 2009 at 14:4 Comment(5)
Thanks. With your examples and the comments on php.net I've worked out the problem (see follow up above).Stain
The solution with DOM seems to me quite heavyweight. How fast or slow it is? It is worth to use it also on smaller snippets, or only on the whole page?Malynda
There is a problem while using saveXML() with some tags with no value such as <textarea type="text" name="name"></textarea> it converts it to <textarea type="text" name="name"/> is there any way I can fix it ?Cataplexy
I hereby release my version of the PHP code in this answer under the terms of the MIT license. Have at it!Wernerwernerite
i've been caught red-handed.Sylviesylvite
E
4

The result:

<!DOCTYPE html>
<html>
    <head>
        <title>My website</title>
    </head>
</html>

Please consider:

function indentContent($content, $tab="\t"){
    $content = preg_replace('/(>)(<)(\/*)/', "$1\n$2$3", $content); // add marker linefeeds to aid the pretty-tokeniser (adds a linefeed between all tag-end boundaries)
    $token = strtok($content, "\n"); // now indent the tags
    $result = ''; // holds formatted version as it is built
    $pad = 0; // initial indent
    $matches = array(); // returns from preg_matches()
    // scan each line and adjust indent based on opening/closing tags
    while ($token !== false && strlen($token)>0){
        $padPrev = $padPrev ?: $pad; // previous padding //Artis
        $token = trim($token);
        // test for the various tag states
        if (preg_match('/.+<\/\w[^>]*>$/', $token, $matches)){// 1. open and closing tags on same line - no change
            $indent=0;
        }elseif(preg_match('/^<\/\w/', $token, $matches)){// 2. closing tag - outdent now
            $pad--;
            if($indent>0) $indent=0;
        }elseif(preg_match('/^<\w[^>]*[^\/]>.*$/', $token, $matches)){// 3. opening tag - don't pad this one, only subsequent tags (only if it isn't a void tag)
            foreach($matches as $m){
                if (preg_match('/^<(area|base|br|col|command|embed|hr|img|input|keygen|link|meta|param|source|track|wbr)/im', $m)){// Void elements according to http://www.htmlandcsswebdesign.com/articles/voidel.php
                    $voidTag=true;
                    break;
                }
            }
            $indent = 1;
        }else{// 4. no indentation needed
            $indent = 0;
        }

        if ($token == "<textarea>") {
            $line = str_pad($token, strlen($token) + $pad, $tab, STR_PAD_LEFT); // pad the line with the required number of leading spaces
            $result .= $line; // add to the cumulative result, with linefeed
            $token = strtok("\n"); // get the next token
            $pad += $indent; // update the pad size for subsequent lines
        } elseif ($token == "</textarea>") {
            $line = $token; // pad the line with the required number of leading spaces
            $result .= $line . "\n"; // add to the cumulative result, with linefeed
            $token = strtok("\n"); // get the next token
            $pad += $indent; // update the pad size for subsequent lines
        } else {
            $line = str_pad($token, strlen($token) + $pad, $tab, STR_PAD_LEFT); // pad the line with the required number of leading spaces
            $result .= $line . "\n"; // add to the cumulative result, with linefeed
            $token = strtok("\n"); // get the next token
            $pad += $indent; // update the pad size for subsequent lines
            if ($voidTag) {
                $voidTag = false;
                $pad--;
            }
        }           

    return $result;
}

//$htmldoc - DOMdocument Object!

$niceHTMLwithTABS = indentContent($htmldoc->saveHTML(), $tab="\t");

echo $niceHTMLwithTABS;

Will result in HTML that has:

  • Indentation based on "levels"
  • Line breaks after block level elements
  • While inline and self-closing elements are not affected

The function (which is a method for class I use) is largely based on: https://mcmap.net/q/453437/-keeping-line-breaks-when-using-php-39-s-domdocument-appendchild

Ethics answered 24/5, 2020 at 18:58 Comment(1)
I think, someone just edited my code into non-functional one. There's an opening for while loop, but no closing anymore.Ethics
T
-2

You can use the code for the hl_tidy function of the htmLawed library.

// indent using one tab per indent, with all HTML being within an imaginary div
$out = hl_tidy($in, 't', 'div')
Toneytong answered 13/8, 2012 at 14:37 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.