Loop over DOMDocument
Asked Answered
G

5

27

I am following the suggestion from this question Robust, Mature HTML Parser for PHP, about parsing html that may be malformed with DOMDocument.

Is there any easy way to loop over the parsed document? So I would like to loop over html like this.

$html='<ul>
         <li>value1</li>
         <li>value1</li>
         <li>value3
            <p>subvalue</p>
         </li>
        </ul>
        <p>hello world</p>';

$doc = new DOMDocument();
$doc->loadHTML($html);
???
foreach (??? as $node)
{
  print $node->nodeName.':'.$node->nodeValue;
}

And get results somewhat like this.

 ul:
 li:value1
 li:value2
 li:value3
 p:subvalue
 p:hello world

Using $doc->childNodes by itself doesn't really do what I want. Since it doesn't seem to go down to lower branches in the tree. I used the code suggested by halfdan and I get results like this.

html:
html:value1
         value1
         value3
            subvalue

        hello world
Grader answered 26/5, 2010 at 2:51 Comment(1)
DOM objects can (but don't always) have a property called $childNodes that you can iterate over. You can check for the presence or otherwise of this property with the hasChildNodes() method.Firkin
H
45

Try this:

$doc = new DOMDocument();
$doc->loadHTML($html);
showDOMNode($doc);

function showDOMNode(DOMNode $domNode) {
    foreach ($domNode->childNodes as $node)
    {
        print $node->nodeName.':'.$node->nodeValue;
        if($node->hasChildNodes()) {
            showDOMNode($node);
        }
    }    
}
Hypersonic answered 26/5, 2010 at 2:59 Comment(1)
Thanks, I have updated my question to be more clear. I don't believe $doc->childNodes by itself does what I want. Basically I want to visit each node in the tree, not just see all nodes at one level.Grader
S
2

I was having issues with elements that had c data, where even elements that didn't have children where returning that they did.

I am not sure why it was.

The work around I found was to change

if($node->hasChildNodes()) {
        showDOMNode($node);
    }

to

if($node->childNodes->length != 1) {
        showDOMNode($node);
    }

And the code now works perfectly.

Septimal answered 23/10, 2012 at 6:5 Comment(0)
R
2

One way is to walk the tree as follow:

function next_node($node)
{
    if($node->firstChild != null)
    {
        return $node->firstChild;
    }

    if($node->nextSibling != null)
    {
        return $node->nextSibling;
    }

    for($node = $node->parentNode; $node != null; $node = $node->parentNode)
    {
        if($node->nextSibling != null)
        {
            return $node->nextSibling;
        }
    }

    return null;
}

for($node = $doc; $node != null; $node = next_node($node))
{
    // handle node (read-only mode, if you need read-write
    // you have to save all the nodes in an array and then
    // use that array
    //
    ...
}

This works for most documents, however it looks like at times the parentNode is somehow not correctly set and the next_node() function ends up returning the wrong information.

Roping answered 4/9, 2017 at 9:28 Comment(0)
M
1

You need to use PHP Simple HTML DOM Parser and the following code:

<?php
require_once 'simplehtmldom/simple_html_dom.php';

function iterateHtmlElements($html)
{
    $dom = str_get_html($html);
    $dom->set_callback('handleElement');
    $dom->__toString();
    echo "\n";
}

function handleElement(simple_html_dom_node $elem)
{
    if($elem->tag == 'text') {
        echo $elem->innertext();
    }
    else {
        echo "\n" . $elem->tag . ": ";
    }
}

$html='<ul>
         <li>value1</li>
         <li>value1</li>
         <li>value3
            <p>subvalue</p>
         </li>
        </ul>
        <p>hello world</p>';
iterateHtmlElements($html);

It works exactly as expected. I checked it with the input you provided and got the following results:

> php test2.php

ul:
li: value1
li: value1
li: value3
p: subvalue
p: hello world
Melvinmelvina answered 17/11, 2013 at 1:6 Comment(2)
Does not work for me, str_get_html just returns false.Larrisa
Ended up using https://mcmap.net/q/505985/-what-is-the-best-php-dom-2-array-functionLarrisa
R
0

If you need to look through some HTML tag, feel free:

$doc = new DOMDocument;
$doc->loadXML($a);
$nodes = $doc->getElementsByTagName("tr");
$xml = "";
foreach ($nodes as $node) {
    // you can extract here content of some <td> tag
    $xml .= $doc->saveXML($node);
}
var_dump(htmlentities($xml));
Radburn answered 8/5, 2023 at 13:30 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.