Preventing DOMDocument::loadHTML() from converting entities
Asked Answered
D

4

5

I have a string value that I'm trying to extract list items for. I'd like to extract the text and any subnodes, however, DOMDocument is converting the entities to the character, instead of leaving in the original state.

I've tried setting DOMDocument::resolveExternals and DOMDocument::substituteEntities for false, but this has no effect. It should be noted I'm running on Win7 with PHP 5.2.17.

Example code is:

$example = '<ul><li>text</li>'.
    '<li>&frac12; of this is <strong>strong</strong></li></ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));
    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;
}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        $innerHTML .= $child->ownerDocument->saveXML( $child );
    }

    return $innerHTML;
}

&frac12; ends up getting converted to ½ (single character / UTF-8 version, not entity version), which is not the desired format.

Dray answered 8/9, 2011 at 4:56 Comment(8)
How are you determining the conversion took place? Are you displaying the results in HTML?Groin
With an echo (the real code is a bit more complicated). I'll update the example code with the echos that I'm using at the moment. The echo'd results are being output to a log file. Results are being displayed in Textpad (like Notepad), and not HTML.Dray
How are you loading the $example string into the DOMDocument?Groin
5.3.6 - php.net/manual/en/domdocument.savehtml.php (This support $doc->saveHTML( new DOMNode('&frac12;') );Frustrate
@Phil. There's something to be said for making sure example code actually works before putting it up. But it actually works.Dray
@ajreal, I was hoping to avoid upgrading PHP, just for that feature. I guess the work around for PHP 5.2.X is to use saveHTMLFile, then load and strip the DOCTYPE. Nasty.Dray
@Frustrate I tried saveHTML(DOMNode $node) in 5.3.8 and it still translates the entity.Groin
Sorry, Don't have 5.3.6++ to test. How about $doc->saveHTML( new DOMText('&frac12;') )Frustrate
D
3

Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.

It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.

$example = '<ul><li>text</li>'.
'<li>&frac12; of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in &frac12; tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in &frac12; <em>tag &frac12;</em></strong></li>'.
'</ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));

    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;

}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        echo 'Node type is '.$child->nodeType.PHP_EOL;
        switch ($child->nodeType) {
        case 3:
            $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
            break;
        default:
            echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
            echo 'Node name '.$child->nodeName.PHP_EOL;
            $innerHTML .= '<'.$child->nodeName.'>';
            $innerHTML .= _get_inner_html( $child );
            $innerHTML .= '</'.$child->nodeName.'>';
            break;
        }
    }

    return $innerHTML;
}
Dray answered 8/9, 2011 at 6:46 Comment(1)
Use ISO-8859-1//TRANSLIT or ISO-8859-1//IGNORE to avoid notices, and having the string truncated for characters that don't convert successfully. For example, presence of &trade; resulted in a notice, and was converted to TM with the //TRANSLIT option.Dray
F
6

Solution for not PHP 5.3.6++

$html =<<<HTML
<ul><li>text</li>
<li>&frac12; of this is <strong>strong</strong></li></ul>
HTML;

$doc = new DOMDocument();
$doc->resolveExternals = false;
$doc->substituteEntities = false;
$doc->loadHTML($html);
foreach ($doc->getElementsByTagName('li') as $node)
{
  echo htmlentities(iconv('UTF-8', 'ISO-8859-1', $node->nodeValue)), "\n";
}
Frustrate answered 8/9, 2011 at 5:47 Comment(1)
It treats &frac12; correctly, but strips <strong>. I might try something where _get_inner_html() recognises the reference between DOMElement and DOMText, and uses an appropriate function to convert (either htmlentities or a recursive call).Dray
D
3

Based on the answer provided by ajreal, I've expanded the example variable to handle more cases, and changed _get_inner_html() to make recursive calls and handle the entity conversion for text nodes.

It's probably not the best answer, since it makes some assumptions about the elements (such as no attributes). But since my particular needs don't require attributes to be carried across (yet.. I'm sure my sample data will throw that one at me later on), this solution works for me.

$example = '<ul><li>text</li>'.
'<li>&frac12; of this is <strong>strong</strong></li>'.
'<li>Entity <strong attr="3">in &frac12; tag</strong></li>'.
'<li>Nested nodes <strong attr="3">in &frac12; <em>tag &frac12;</em></strong></li>'.
'</ul>';

echo 'To be converted:'.PHP_EOL.$example.PHP_EOL;

$doc = new DOMDocument();
$doc->resolveExternals = true;
$doc->substituteEntities = false;

$doc->loadHTML($example);

$domNodeList = $doc->getElementsByTagName('li');
$count = $domNodeList->length;

for ($idx = 0; $idx < $count; $idx++) {
    $value = trim(_get_inner_html($domNodeList->item($idx)));

    /* remainder of processing and storing in database */
    echo 'Saved '.$value.PHP_EOL;

}

function _get_inner_html( $node ) {
    $innerHTML= '';
    $children = $node->childNodes;
    foreach ($children as $child) {
        echo 'Node type is '.$child->nodeType.PHP_EOL;
        switch ($child->nodeType) {
        case 3:
            $innerHTML .= htmlentities(iconv('UTF-8', 'ISO-8859-1', $child->nodeValue));
            break;
        default:
            echo 'Non text node has '.$child->childNodes->length.' children'.PHP_EOL;
            echo 'Node name '.$child->nodeName.PHP_EOL;
            $innerHTML .= '<'.$child->nodeName.'>';
            $innerHTML .= _get_inner_html( $child );
            $innerHTML .= '</'.$child->nodeName.'>';
            break;
        }
    }

    return $innerHTML;
}
Dray answered 8/9, 2011 at 6:46 Comment(1)
Use ISO-8859-1//TRANSLIT or ISO-8859-1//IGNORE to avoid notices, and having the string truncated for characters that don't convert successfully. For example, presence of &trade; resulted in a notice, and was converted to TM with the //TRANSLIT option.Dray
S
0

I'm a bit late and maybe it's not exactly your case, but I hate hacks and I found the cleanest way to avoid the conversions you're talking about:

$d = new DOMDocument('1.0', 'UTF-8');
$d->loadXML('<?xml version="1.0" encoding="UTF-8"?><t>Hello &#xBD; World</t>');
print_r($d->saveXML());

output: <t>Hello ½ World</t>
$d = new DOMDocument('1.0', 'UTF-8');
$d->loadXML('<?xml version="1.0"?><t>Hello &#xBD; World</t>');
print_r($d->saveXML());

output: <t>Hello &#xBD; World</t>
Spirometer answered 15/10 at 7:35 Comment(0)
C
-1

Need no iterate child nodes:

function innerHTML($node)
         {$html=$node->ownerDocument->saveXML($node);
          return preg_replace("%^<{$node->nodeName}[^>]*>|</{$node->nodeName}>$%", '', $html);
         }
Copeck answered 6/6, 2012 at 9:50 Comment(1)
What replaces the htmlentites(iconv()) call in this example? It looks like it only strips the outer tag.Dray

© 2022 - 2024 — McMap. All rights reserved.