PHP DOMDocument adds extra tags
Asked Answered
L

5

9

I'm trying to parse a document and get all the image tags and change the source for something different.

$domDocument = new DOMDocument();

$domDocument->loadHTML($text);

$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
  $Image->setAttribute('src', 'lalala');
  $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();

The $text initially looks like this:

<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>

and this is the output $text:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><p>Hi, this is a test, here is an image<img src="lalala" width="68" height="95"> Because I like Beer!</p></body></html>

I'm getting a bunch of extra tags (HTML, body, and the comment at the top) that I don't really need. Any way to set up the DOMDocument to avoid adding these extra tags?

Lentigo answered 26/1, 2011 at 0:45 Comment(0)
C
4

DomDocument is unfortunately retarded and won't let you do this. Try this:

$text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));
Cyma answered 26/1, 2011 at 1:39 Comment(2)
it should read: $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));Phototransistor
preg_replace, really?Hillegass
O
21

You just need to add 2 flags to the loadHTML() method: LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD. I.e.

$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);

See IDEONE demo:

$text = '<p>Hi, this is a test, here is an image<img src="http://example.com/beer.jpg" width="60" height="95" /> Because I like Beer!</p>';
$domDocument = new DOMDocument;
$domDocument->loadHTML($text, LIBXML_HTML_NOIMPLIED|LIBXML_HTML_NODEFDTD);
$imageNodeList = $domDocument->getElementsByTagName('img');

foreach ($imageNodeList as $Image) {
      $Image->setAttribute('src', 'lalala');
      $domDocument->saveHTML($Image);
}

$text = $domDocument->saveHTML();
echo $text;

Output:

<p>Hi, this is a test, here is an image<img src="lalala" width="60" height="95"> Because I like Beer!</p>
Orbital answered 15/7, 2015 at 9:22 Comment(3)
For me that just strips all html out of there. My paragraphs are gone too.Saberhagen
@Mike: That is impossible as the code does not remove anything. Maybe the HTML you have is not fully valid. Try adding libxml_use_internal_errors(true); before initializing the DOMDocument with $domDocument = new DOMDocument;.Hulk
@WiktorStribiżew I was using it to strip the Script tags out of a text field as per here: #7131367Saberhagen
C
4

DomDocument is unfortunately retarded and won't let you do this. Try this:

$text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));
Cyma answered 26/1, 2011 at 1:39 Comment(2)
it should read: $text = preg_replace('/^<!DOCTYPE.+?>/', '', str_replace( array('<html>', '</html>', '<body>', '</body>'), array('', '', '', ''), $domDocument->saveHTML()));Phototransistor
preg_replace, really?Hillegass
L
1

If you are up to a hack, this is the way I managed to go around this annoyance. Load the string as XML and save it as HTML. :)

Leicester answered 26/1, 2011 at 0:59 Comment(0)
E
0

you can use http://beerpla.net/projects/smartdomdocument-a-smarter-php-domdocument-class/ :

DOMDocument has an extremely badly designed "feature" where if the HTML code you are loading does not contain and tags, it adds them automatically (yup, there are no flags to turn this behavior off).

Thus, when you call $doc->saveHTML(), your newly saved content now has and DOCTYPE in it. Not very handy when trying to work with code fragments (XML has a similar problem).

SmartDOMDocument contains a new function called saveHTMLExact() which does exactly what you would want – it saves HTML without adding that extra garbage that DOMDocument does.

Extine answered 12/1, 2015 at 23:9 Comment(0)
W
-2

If you're going to save as HTML, you have to expect a valid HTML document to be created!

There is another option: DOMDocument::saveXML has an optional parameter allowing you to access the XML content of a particular element:

$el = $domDocument->getElementsByTagName('p')->item(0);
$text = $domDocument->saveXML($el);

This presumes that your content only has one p element.

Watchtower answered 26/1, 2011 at 0:51 Comment(1)
depending on the used elements inside the document it's not always a good idea to use saveXML() to retrieve a HTML-source. The created XML will use the shorthand for all elements without content, what will damage the HTML-document(e.g. <script src="some.js"/>). You'll need to parse the result and correct it or transform it using XSLT to get a valid HTML-document.Burlburlap

© 2022 - 2024 — McMap. All rights reserved.