I have been using CKEditor wysiwyg editor for a website where users are allowed to use the HTML editor to add some comments. I ended up having some extremely redundant nested HTML code in my database that is slowing down the viewing/editing of these comments.
I have comments that look like this (this is a very small example. I have comments with over 100 nested tags):
<p>
<strong>
<span style="font-size: 14px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">
<span style="font-size: 14px">
<span style="font-size: 16px">
<span style="color: #006400">This is a </span>
</span>
</span>
</span>
</span>
</span>
</span>
<span style="color: #006400">
<span style="font-size: 16px">
<span style="color: #b22222">Test</span>
</span>
</span>
</span>
</span>
</strong>
</p>
My questions are:
Is there any library/code/software that can do a smart (i.e. format-aware) clean-up of the HTML code, removing all redundant tags that have no effect on the formatting (because they're overridden by inner tags) ? I've tried many existing online solutions (such as HTML Tidy). None of them do what I want.
If not, I'll need to write some code for HTML parsing and cleaning. I am planning to use PHP Simple HTML DOM to traverse the HTML tree and find all tags that have no effect. Do you suggest any other HTML parser that is more suitable for my purpose?
Thanks
.
Update:
I have written some code to analyze the HTML code that I have. All the HTML tags that I have are:
<span>
with styles forfont-size
and/orcolor
<font>
with attributescolor
and/orsize
<a>
for links (withhref
)<strong>
<p>
(single tag to wrap the whole comment)<u>
I can easily write some code to convert the HTML code into bbcode (e.g. [b]
, [color=blue]
, [size=3]
, etc). So I above HTML will become something like:
[b][size=14][color=#006400][size=14][size=16][color=#006400]
[size=14][size=16][color=#006400]This is a [/color][/size]
[/size][/color][/size][/size][color=#006400][size=16]
[color=#b22222]Test[/color][/size][/color][/color][/size][/b]
The question now is: Is there an easy way (algorithm/library/etc) to clean-up the messy (as messy as that original HTML) bbcode that will be generated?
thanks again