PHP DOMDocument adds extra tags

L

5

9

I'm trying to parse a document and get all the image tags and change the source for something different.

$domDocument = new DOMDocument();

$domDocument->loadHTML($text);

$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
  $Image->setAttribute('src', 'lalala');
  $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();

The $text initially looks like this:

<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>

and this is the output $text:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hi, this is a test, here is an image<img src="lalala" width="68" height="95"> Because I like Beer!</p></body></html>

I'm getting a bunch of extra tags (HTML, body, and the comment at the top) that I don't really need. Any way to set up the DOMDocument to avoid adding these extra tags?

Lentigo answered 26/1, 2011 at 0:45 Comment(0)

C

4

DomDocument is unfortunately retarded and won't let you do this. Try this:

$text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));

Cyma answered 26/1, 2011 at 1:39 Comment(2)

it should read: $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML())); – Phototransistor 14/5, 2011 at 6:58

preg_replace, really? – Hillegass 15/9, 2017 at 16:52

O

21

You just need to add 2 flags to the loadHTML() method: LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD. I.e.

$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);

See IDEONE demo:

$text = '<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>';
$domDocument = new DOMDocument;
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
      $Image->setAttribute('src', 'lalala');
      $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();
echo $text;

Output:

<p>Hi, this is a test, here is an image<img src="lalala" width="60" height="95"> Because I like Beer!</p>

Orbital answered 15/7, 2015 at 9:22 Comment(3)

For me that just strips all html out of there. My paragraphs are gone too. – Saberhagen 17/6, 2016 at 20:7

@Mike: That is impossible as the code does not remove anything. Maybe the HTML you have is not fully valid. Try adding libxml_use_internal_errors(true); before initializing the DOMDocument with $domDocument = new DOMDocument;. – Hulk 17/6, 2016 at 20:10

@WiktorStribiżew I was using it to strip the Script tags out of a text field as per here: #7131367 – Saberhagen 20/6, 2016 at 18:15

C

4

DomDocument is unfortunately retarded and won't let you do this. Try this:

$text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));

Cyma answered 26/1, 2011 at 1:39 Comment(2)

it should read: $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML())); – Phototransistor 14/5, 2011 at 6:58

preg_replace, really? – Hillegass 15/9, 2017 at 16:52

L

1

If you are up to a hack, this is the way I managed to go around this annoyance. Load the string as XML and save it as HTML. :)

Leicester answered 26/1, 2011 at 0:59 Comment(0)

E

0

you can use http://beerpla.net/projects/smartdomdocument-a-smarter-php-domdocument-class/ :

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain and tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

Extine answered 12/1, 2015 at 23:9 Comment(0)

W

-2

If you're going to save as HTML, you have to expect a valid HTML document to be created!

There is another option: DOMDocument::saveXML has an optional parameter allowing you to access the XML content of a particular element:

$el = $domDocument->getElementsByTagName('p')->item(0);
$text = $domDocument->saveXML($el);

This presumes that your content only has one p element.

Watchtower answered 26/1, 2011 at 0:51 Comment(1)

depending on the used elements inside the document it's not always a good idea to use saveXML() to retrieve a HTML-source. The created XML will use the shorthand for all elements without content, what will damage the HTML-document(e.g. <script src="some.js"/>). You'll need to parse the result and correct it or transform it using XSLT to get a valid HTML-document. – Burlburlap 26/1, 2011 at 1:18

Recommended topics

Hot tags