Cleaning HTML by removing extra/redundant formatting tags

Asked 20/4, 2012 at 14:26 Answered 25/4, 2012 at 17:27

I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments.

I have comments that look like this (this is a very small example. I have comments with over 100 nested tags):

<p>
 <strong>
  <span style="font-size: 14px">
   <span style="color: #006400">
     <span style="font-size: 14px">
      <span style="font-size: 16px">
       <span style="color: #006400">
        <span style="font-size: 14px">
         <span style="font-size: 16px">
          <span style="color: #006400">This is a </span>
         </span>
        </span>
       </span>
      </span>
     </span>
    </span>
    <span style="color: #006400">
     <span style="font-size: 16px">
      <span style="color: #b22222">Test</span>
     </span>
    </span>
   </span>
  </span>
 </strong>
</p>

My questions are:

Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.
If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?

Thanks

Update:

I have written some code to analyze the HTML code that I have. All the HTML tags that I have are:

 with styles for font-size and/or color
 with attributes color and/or size
<a> for links (with href)

 (single tag to wrap the whole comment)

I can easily write some code to convert the HTML code into bbcode (e.g. [b], [color=blue], [size=3], etc). So I above HTML will become something like:

[b][size=14][color=#006400][size=14][size=16][color=#006400]
[size=14][size=16][color=#006400]This is a [/color][/size]
[/size][/color][/size][/size][color=#006400][size=16]
[color=#b22222]Test[/color][/size][/color][/color][/size][/b]

The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated?

thanks again

The answered 20/4, 2012 at 14:26 Comment(4)

This is going to be a tough one to solve. +1 – Frenchy 20/4, 2012 at 14:34

My suggestion, next time use markdown instead of WYSIWYG. – Proustite 23/4, 2012 at 5:33

Didn't see the update stating that <a href="..."> was possible. Can sample code with <a>,, and tags be supplied so we can tweak our solutions. – Seward 26/4, 2012 at 18:26

Can the text content be mixed with html? Meaning is this possible: This is a <a href="#">test</a>? Or will the last element content always only be text? If the latter, then this is an update of the below: jsfiddle.net/mmeah/fUpe8/3 – Seward 28/4, 2012 at 4:46

Introduction

The best solution have seen so far is using HTML Tidy http://tidy.sourceforge.net/

Beyond converting the format of a document, Tidy is also able to convert deprecated HTML tags into their cascading style sheet (CSS) counterparts automatically through the use of the clean option. The generated output contains an inline style declaration.

It also ensures that the HTML document is xhtml compatible

Example

$code ='<p>
 <strong>
  <span style="font-size: 14px">
   <span style="color: #006400">
     <span style="font-size: 14px">
      <span style="font-size: 16px">
       <span style="color: #006400">
        <span style="font-size: 14px">
         <span style="font-size: 16px">
          <span style="color: #006400">This is a </span>
         </span>
        </span>
       </span>
      </span>
     </span>
    </span>
    <span style="color: #006400">
     <span style="font-size: 16px">
      <span style="color: #b22222">Test</span>
     </span>
    </span>
   </span>
  </span>
 </strong>
</p>';

If you RUN

$clean = cleaning($code);
print($clean['body']);

Output

<p>
    <strong>
        <span class="c3">
            <span class="c1">This is a</span> 
                <span class="c2">Test</span>
            </span>
        </strong>
</p>

You can get the CSS

$clean = cleaning($code);
print($clean['style']);

Output

<style type="text/css">
    span.c3 {
        font-size: 14px
    }

    span.c2 {
        color: #006400;
        font-size: 16px
    }

    span.c1 {
        color: #006400;
        font-size: 14px
    }
</style>

Our the FULL HTML

$clean = cleaning($code);
print($clean['full']);

Output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <title></title>
    <style type="text/css">
/*<![CDATA[*/
    span.c3 {font-size: 14px}
    span.c2 {color: #006400; font-size: 16px}
    span.c1 {color: #006400; font-size: 14px}
    /*]]>*/
    </style>
  </head>
  <body>
    <p>
      <strong><span class="c3"><span class="c1">This is a</span>
      <span class="c2">Test</span></span></strong>
    </p>
  </body>
</html>

Function Used

function cleaning($string, $tidyConfig = null) {
    $out = array ();
    $config = array (
            'indent' => true,
            'show-body-only' => false,
            'clean' => true,
            'output-xhtml' => true,
            'preserve-entities' => true 
    );
    if ($tidyConfig == null) {
        $tidyConfig = &$config;
    }
    $tidy = new tidy ();
    $out ['full'] = $tidy->repairString ( $string, $tidyConfig, 'UTF8' );
    unset ( $tidy );
    unset ( $tidyConfig );
    $out ['body'] = preg_replace ( "/.*<body[^>]*>|<\/body>.*/si", "", $out ['full'] );
    $out ['style'] = '<style type="text/css">' . preg_replace ( "/.*<style[^>]*>|<\/style>.*/si", "", $out ['full'] ) . '</style>';
    return ($out);
}

================================================

Edit 1 : Dirty Hack (Not Recommended)

================================================

Based on your last comment its like you want to retain the depreciate style .. HTML Tidy might not allow you to do that since its depreciated but you can do this

$out = cleaning ( $code );
$getStyle = new css2string ();
$getStyle->parseStr ( $out ['style'] );
$body = $out ['body'];
$search = array ();
$replace = array ();

foreach ( $getStyle->css as $key => $value ) {
    list ( $selector, $name ) = explode ( ".", $key );
    $search [] = "<$selector class=\"$name\">";
    $style = array ();
    foreach ( $value as $type => $att ) {
        $style [] = "$type:$att";
    }
    $replace [] = "<$selector style=\"" . implode ( ";", $style ) . ";\">";
}

Output

<p>
  <strong>
      <span style="font-size:14px;">
        <span style="color:#006400;font-size:14px;">This is a</span>
        <span style="color:#006400;font-size:16px;">Test</span>
        </span>
  </strong>
</p>

Class Used

//Credit : https://mcmap.net/q/428176/-php-css-parser-selector-declarations-to-string
class css2string {
var $css;

function parseStr($string) {
    preg_match_all ( '/(?ims)([a-z0-9, \s\.\:#_\-@]+)\{([^\}]*)\}/', $string, $arr );
    $this->css = array ();
    foreach ( $arr [0] as $i => $x ) {
        $selector = trim ( $arr [1] [$i] );
        $rules = explode ( ';', trim ( $arr [2] [$i] ) );
        $this->css [$selector] = array ();
        foreach ( $rules as $strRule ) {
            if (! empty ( $strRule )) {
                $rule = explode ( ":", $strRule );
                $this->css [$selector] [trim ( $rule [0] )] = trim ( $rule [1] );
            }
        }
    }
}

function arrayImplode($glue, $separator, $array) {
    if (! is_array ( $array ))
        return $array;
    $styleString = array ();
    foreach ( $array as $key => $val ) {
        if (is_array ( $val ))
            $val = implode ( ',', $val );
        $styleString [] = "{$key}{$glue}{$val}";

    }
    return implode ( $separator, $styleString );
}

function getSelector($selectorName) {
    return $this->arrayImplode ( ":", ";", $this->css [$selectorName] );
}

}

Janinejanis answered 25/4, 2012 at 1:28 Comment(3)

Great effort. Thank you. I have already starting to write my own clean-up code in PHP (using "Simple HTML DOM Parser"), but it's taking too much time! I will try your solution right now. – The 27/4, 2012 at 10:14

This is simpler and faster than you think ... all you need is to adjust HTMLTidy config that is all .. it would not only work for span p div .. all HTML and CSS tags – Janinejanis 27/4, 2012 at 10:59

<s>what is the key option in your config ? i tried with --clean but it does not eliminate redundant  tags for me.</s> forget my stupidity. it was -c or --clean yes. – Manama 18/12, 2014 at 3:39

Here is a solution that uses the browser to get the nested element's properties. No need to cascade the properties up, since the css computed styles is ready to read from the browser.

Here is an example: http://jsfiddle.net/mmeah/fUpe8/3/

var fixedCode = readNestProp($("#redo"));
$("#simp").html( fixedCode );

function readNestProp(el){
 var output = "";
 $(el).children().each( function(){
    if($(this).children().length==0){
        var _that=this;
        var _cssAttributeNames = ["font-size","color"];
        var _tag = $(_that).prop("nodeName").toLowerCase();
        var _text = $(_that).text();
        var _style = "";
        $.each(_cssAttributeNames, function(_index,_value){
            var css_value = $(_that).css(_value);
            if(typeof css_value!= "undefined"){
                _style += _value + ":";
                _style += css_value + ";";
            }
        });
        output += "<"+_tag+" style='"+_style+"'>"+_text+"</"+_tag+">";
    }else if(
        $(this).prop("nodeName").toLowerCase() !=
        $(this).find(">:first-child").prop("nodeName").toLowerCase()
    ){
        var _tag = $(this).prop("nodeName").toLowerCase();
        output += "<"+_tag+">" + readNestProp(this) + "</"+_tag+">";
    }else{
        output += readNestProp(this);
    };
 });
 return output;
}

A better solution to typing in all possible css attributes like:
var _cssAttributeNames = ["font-size","color"];
Is to use a solution like mentioned here: Can jQuery get all CSS styles associated with an element?

Seward answered 25/4, 2012 at 6:14 Comment(2)

Updated version with detection of parent elements that are different than the nested child: jsfiddle.net/pLkwD/7 – Seward 25/4, 2012 at 6:33

Updated version with detection of parent elements that are different than the nested child and output of result into a textarea: jsfiddle.net/mmeah/fUpe8/1 – Seward 25/4, 2012 at 6:53

You should look into HTMLPurifier, it's a great tool for parsing HTML and removing unnecessary and unsafe content from it. Look into the removing empty spans configs and stuff. It can be a bit of a beast to configure I admit, but that's only because it's so versatile.

It's also quite heavy, so you'd want to save the output of it the database (As opposed to reading the raw from the database and then parsing it with purifier every time.

Hypoderm answered 20/4, 2012 at 14:30 Comment(2)

Thanks for the answer. I have tried HTMLPurifier before (using their online Demo). It does not remove redundant tags (such as test). Can this be enabled in the configuration? – The 20/4, 2012 at 14:45

Hmm looking at the docs now it may not cover your exact problem. Try looking at these settings: htmlpurifier.org/live/configdoc/… If that doesn't help you potentially need to use a smarter WYSIWYG editor – Hypoderm 20/4, 2012 at 15:52

I don't have time to finish this... maybe someone else can help. This javascript removes exact duplicate tags and disallowed tags too...

There are a few problems/things to be done,
1) regenerated tags need to be closed
2) it will only remove a tag if the tag-name & attributes are identical to another within that nodes children, so its not 'smart' enough to remove all unnecessary tags.
3) it will look through the allowed CSS variables and extract ALL those values from an element, and then write it to the output HTML, so for example:

var allowed_css = ["color","font-size"];
<span style="font-size: 12px"><span style="color: #123123">

Will be translated into:

<span style="color:#000000;font-size:12px;"> <!-- inherited colour from parent -->
<span style="color:#123123;font-size:12px;"> <!-- inherited font-size from parent -->

Code:

<html>

<head>
<script type="text/javascript">
var allowed_css = ["font-size", "color"];
var allowed_tags = ["p","strong","span","br","b"];
function initialise() {
    var comment = document.getElementById("comment");
    var commentHTML = document.getElementById("commentHTML");
    var output = document.getElementById("output");
    var outputHTML = document.getElementById("outputHTML");
    print(commentHTML, comment.innerHTML, false);
    var out = getNodes(comment);
    print(output, out, true);
    print(outputHTML, out, false);
}
function print(out, stringCode, allowHTML) {
    out.innerHTML = allowHTML? stringCode : getHTMLCode(stringCode);
}
function getHTMLCode(stringCode) {
    return "<code>"+((stringCode).replace(/</g,"&lt;")).replace(/>/g,"&gt;")+"</code>";
}
function getNodes(elem) {
    var output = "";
    var nodesArr = new Array(elem.childNodes.length);
    for (var i=0; i<nodesArr.length; i++) {
        nodesArr[i] = new Array();
        nodesArr[i].push(elem.childNodes[i]);
        getChildNodes(elem.childNodes[i], nodesArr[i]);
        nodesArr[i] = removeDuplicates(nodesArr[i]);
        output += nodesArr[i].join("");
    }
    return output;
}
function removeDuplicates(arrayName) {
    var newArray = new Array();
    label:
    for (var i=0; i<arrayName.length; i++) {  
        for (var j=0; j<newArray.length; j++) {
            if(newArray[j]==arrayName[i])
                continue label;
        }
        newArray[newArray.length] = arrayName[i];
    }
    return newArray;
}
function getChildNodes(elemParent, nodesArr) {
    var children = elemParent.childNodes;
    for (var i=0; i<children.length; i++) {
        nodesArr.push(children[i]);
        if (children[i].hasChildNodes())
            getChildNodes(children[i], nodesArr);
    }
    return cleanHTML(nodesArr);
}
function cleanHTML(arr) {
    for (var i=0; i<arr.length; i++) {
        var elem = arr[i];
        if (elem.nodeType == 1) {
            if (tagNotAllowed(elem.nodeName)) {
                arr.splice(i,1);
                i--;
                continue;
            }
            elem = "<"+elem.nodeName+ getAttributes(elem) +">";
        }
        else if (elem.nodeType == 3) {
            elem = elem.nodeValue;
        }
        arr[i] = elem;
    }
    return arr;
}
function tagNotAllowed(tagName) {
    var allowed = " "+allowed_tags.join(" ").toUpperCase()+" ";
    if (allowed.search(" "+tagName.toUpperCase()+" ") == -1)
        return true;
    else
        return false;
}
function getAttributes(elem) {
    var attributes = "";
    for (var i=0; i<elem.attributes.length; i++) {
      var attrib = elem.attributes[i];
      if (attrib.specified == true) {
        if (attrib.name == "style") {
            attributes += " style=\""+getCSS(elem)+"\"";
        } else {
            attributes += " "+attrib.name+"=\""+attrib.value+"\"";
        }
      }
    }
    return attributes
}
function getCSS(elem) {
    var style="";
    if (elem.currentStyle) {
        for (var i=0; i<allowed_css.length; i++) {
            var styleProp = allowed_css[i];
            style += styleProp+":"+elem.currentStyle[styleProp]+";";
        }
    } else if (window.getComputedStyle) {
        for (var i=0; i<allowed_css.length; i++) {
            var styleProp = allowed_css[i];
            style += styleProp+":"+document.defaultView.getComputedStyle(elem,null).getPropertyValue(styleProp)+";";
        }
    }
    return style;
}
</script>
</head>

<body onload="initialise()">

<div style="float: left; width: 300px;">
<h2>Input</h2>
<div id="comment">
<p> 
 <strong> 
  <span style="font-size: 14px"> 
   <span style="color: #006400"> 
     <span style="font-size: 14px"> 
      <span style="font-size: 16px"> 
       <span style="color: #006400"> 
        <span style="font-size: 14px"> 
         <span style="font-size: 16px"> 
          <span style="color: #006400">This is a </span> 
         </span> 
        </span> 
       </span> 
      </span> 
     </span> 
    </span> 
    <span style="color: #006400"> 
     <span style="font-size: 16px"> 
      <span style="color: #b22222"><b>Test</b></span> 
     </span> 
    </span> 
   </span> 
  </span> 
 </strong> 
</p> 
<p>Second paragraph.
<span style="color: #006400">This is a span</span></p>
</div>
<h3>HTML code:</h3>
<div id="commentHTML"> </div>
</div>

<div style="float: left; width: 300px;">
<h2>Output</h2>
<div id="output"> </div>
<h3>HTML code:</h3>
<div id="outputHTML"> </div>
</div>

<div style="float: left; width: 300px;">
<h2>Tasks</h2>
<big>
<ul>
<li>Close Tags</li>
<li>Ignore inherited CSS style in method getCSS(elem)</li>
<li>Test with different input HTML</li>
</ul>
</big>
</div>

</body>

</html>

Lashaunda answered 23/4, 2012 at 5:8 Comment(0)

It may not exactly address your exact problem, but what I would have done in your place is to simply eliminate all HTML tags completely, retain only pain text and line breaks.

After that was done, switch to markdown our bbcode to format your comments better. A WYSIWYG is rarely useful.

The reason forthat is because you said that all you had in the comments is presentational data, which frankly, isn't that much important.

Florid answered 23/4, 2012 at 5:39 Comment(1)

I agree that using a WYSIWYG editor was a very bad idea. I am switching all the editors in my site to BBCode, but I need first to convert all the existing comments to BBCode (while maintaining their style/format). Thanks – The 27/4, 2012 at 10:12

Cleanup HTML collapses tags which seems to be what you are asking for. However, it creates a validated HTML document with CSS moved to inline styles. Many other HTML formatters won't do this because it changes the structure of the HTML document.

Amontillado answered 25/4, 2012 at 17:27 Comment(0)

I remember that Adobe (Macromedia) Dreamweaver, at least slightly old versions had an option, 'Clean up HTML', and also a 'Clean up word html' to remove redundant tags etc from any webpage.

Thedathedric answered 20/4, 2012 at 14:50 Comment(4)

That's nice. Not really an answer to the problem though. – Bear 20/4, 2012 at 14:53

Thanks, Manoj. I have actually tried that feature. It does 'blind' cleaning for HTML tags, but it can not clean something like test. – The 20/4, 2012 at 15:9

I tried test through Dreamweaver and it took out the extra . However, the original example is another story and didn't work. Since you have the same tag nested, but with different inline styles manually set on them, that complicates it. Maybe another approach is needed. Could you get the html without all those spans with a style attribute? You could also try HTML Tidy which also has a library you can use to help you tidy it the way you want. – Thedathedric 20/4, 2012 at 15:33

Oh .. perhaps I miss-remembered. As you mentioned, however, the actual case that I am facing is more complicated and can not be handled by Dreamweaver. I will see if HTML Tidy can do something about it. Thanks again. – The 20/4, 2012 at 15:44

I know you're looking for an HTML DOM cleanser, but maybe js can help?

function getSpans(){ 
var spans=document.getElementsByTagName('span') 
    for (var i=0;i<spans.length;i++){ 
    spans[i].removeNode(true);
        if(i == spans.length) {
        //add the styling you want here
        }
    } 
}

Wittgenstein answered 20/4, 2012 at 14:55 Comment(0)

Rather than waste your precious server time parsing bad HTML I would suggest you fix the root of the problem instead.

A simple solution would be to limit the characters each commentor could make to include the entire html char count as opposed to just the text count (at least that would stop infinately-large nested tags).

You could improve on that by allowing the user to switch between HTML-view and text-view - I'm sure most people would see a load of junk when in the HTML view and simply CTRL+A & DEL it.

I think it would be best if you had your own formatting chars you would parse and replace with the formatting i.e. like stack-overflow has **bold text**, visible to the poster. Or just BB-code would do, visibile to the poster.

Lashaunda answered 22/4, 2012 at 17:26 Comment(2)

I completely agree. Using an HTML wysiwyg editor was clearly a big mistake. I am working on replacing it with BBCode editor with a small subset of formatting-tags. However, I need a way of cleaning-up and fixing the existing comments without deleting/destroying/ruining them. – The 22/4, 2012 at 17:29

@The Just remove all of the span tags and preserve the plain-text & line breaks... I'm sure it won't be that much of a loss. Anyway, it's not really worth developing something like that just for your old comments. – Lashaunda 22/4, 2012 at 17:58

Try not to parse the HTML with DOM but maybe with SAX (http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm)

SAX parses a document from the beginning and sends events like 'start of element' and 'end of 'element' to call the callback functions you define

Then you can build a kind of stack for all events If you have text, you could save the effect of your stack on that text.

After that you process the stack to build up new HTML with only the effect you want.

Frisbee answered 23/4, 2012 at 17:27 Comment(2)

Thanks. DOM provides a tree-like structure that is easier to deal with then it comes to nested tags, which is very important in order to determine which styles have been overridden by other tags/styles. – The 27/4, 2012 at 10:10

You can of course your goal with both ways, but I think it would take more ressourced two climb up the DOM tree with every single element – Frisbee 29/4, 2012 at 21:16

If you want to use jQuery, try this:

<p>
<strong>
  <span style="font-size: 14px">
   <span style="color: #006400">
     <span style="font-size: 14px">
      <span style="font-size: 16px">
       <span style="color: #006400">
        <span style="font-size: 14px">
         <span style="font-size: 16px">
          <span style="color: #006400">This is a </span>
         </span>
        </span>
       </span>
      </span>
     </span>
    </span>
    <span style="color: #006400">
     <span style="font-size: 16px">
      <span style="color: #b22222">Test</span>
     </span>
    </span>
   </span>
  </span>
 </strong>
</p>
<br><br>
<div id="out"></div> <!-- Just to print it out -->


$("span").each(function(i){
    var ntext = $(this).text();
    ntext = $.trim(ntext.replace(/(\r\n|\n|\r)/gm," "));
    if(i==0){
        $("#out").text(ntext);
    }        
});

You get this as a result:

<div id="out">This is a                                                                    Test</div>

You could then format it anyway you want. Hope that helps you think a little differently about it...

Alkyd answered 25/4, 2012 at 17:13 Comment(1)

Thank you. My objective is to keep the formatting for the text unchanged. I just want to remove all the extra formatting tags that are not effective because they are overridden. You solution extracts the strings without any formatting. – The 27/4, 2012 at 10:7

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Update:

Introduction

Example

Function Used

Edit 1 : Dirty Hack (Not Recommended)

Class Used

Code:

Recommended topics

Hot tags