CakePHP Xml utility library triggers DOMDocument warning
Asked Answered
H

4

8

I'm generating XML in a view with CakePHP's Xml core library:

$xml = Xml::build($data, array('return' => 'domdocument'));
echo $xml->saveXML();

View is fed from the controller with an array:

$this->set(
    array(
        'data' => array(
            'root' => array(
                array(
                    '@id' => 'A & B: OK',
                    'name' => 'C & D: OK',
                    'sub1' => array(
                        '@id' => 'E & F: OK',
                        'name' => 'G & H: OK',
                        'sub2' => array(
                            array(
                                '@id' => 'I & J: OK',
                                'name' => 'K & L: OK',
                                'sub3' => array(
                                    '@id' => 'M & N: OK',
                                    'name' => 'O & P: OK',
                                    'sub4' => array(
                                        '@id' => 'Q & R: OK',
                                        '@'   => 'S & T: ERROR',
                                    ),
                                ),
                            ),
                        ),
                    ),
                ),
            ),
        ),
    )
);

For whatever the reason, CakePHP is issuing an internal call like this:

$dom = new DOMDocument;
$key = 'sub4';
$childValue = 'S & T: ERROR';
$dom->createElement($key, $childValue);

... which triggers a PHP warning:

Warning (2): DOMDocument::createElement(): unterminated entity reference               T [CORE\Cake\Utility\Xml.php, line 292

... because (as documented), DOMDocument::createElement does not escape values. However, it only does it in certain nodes, as the test case illustrates.

Am I doing something wrong or I just hit a bug in CakePHP?

Honig answered 9/4, 2014 at 8:1 Comment(2)
wrapping value like that $dom->createElement($key, htmlspecialchars($childValue)); will do the trickSaliva
@Saliva - Please read the question again. This is a CakePHP question and I'm not calling DOM functions directly, just building an array. And I cannot patch the CakePHP core that way because some elements are already escaped, some others are not. (See the accepted answer for some additional details.)Negative
H
-1

The problem seems to be in nodes that have both attributes and values thus need to use the @ syntax:

'@id' => 'A & B: OK',  // <-- Handled as plain text
'name' => 'C & D: OK', // <-- Handled as plain text
'@' => 'S & T: ERROR', // <-- Handled as raw XML

I've written a little helper function:

protected function escapeXmlValue($value){
    return is_null($value) ? null : htmlspecialchars($value, ENT_XML1, 'UTF-8');
}

... and take care of calling it manually when I create the array:

'@id' => 'A & B: OK',
'name' => 'C & D: OK',
'@' => $this->escapeXmlValue('S & T: NOW WORKS FINE'),

It's hard to say if it's bug or feature since the documentation doesn't mention it.

Honig answered 9/4, 2014 at 8:50 Comment(0)
F
16

This is a bug in PHPs DOMDocument::createElement() method. Here are two ways to avoid the problem.

Create Text Nodes

Create the textnode separately and append it to the element node.

$dom = new DOMDocument;
$dom
  ->appendChild($dom->createElement('element'))
  ->appendChild($dom->createTextNode('S & T: ERROR'));

var_dump($dom->saveXml());

Output:

string(58) "<?xml version="1.0"?>
<element>S &amp; T: ERROR</element>
"

This is the originally intended way to add text nodes to a DOM. You always create a node (element, text , cdata, ...) and append it to its parent node. You can add more then one node and different kind of nodes to one parent. Like in the following example:

$dom = new DOMDocument;
$p = $dom->appendChild($dom->createElement('p'));
$p->appendChild($dom->createTextNode('Hello '));
$b = $p->appendChild($dom->createElement('b'));
$b->appendChild($dom->createTextNode('World!'));

echo $dom->saveXml();

Output:

<?xml version="1.0"?>
<p>Hello <b>World!</b></p>

Property DOMNode::$textContent

DOM Level 3 introduced a new node property called textContent. It abstracts the content/value of a node depending on the node type. Setting the $textContent of an element node will replace all its child nodes with a single text node. Reading it returns the content of all descendant text nodes.

$dom = new DOMDocument;
$dom
  ->appendChild($dom->createElement('element'))
  ->textContent = 'S & T: ERROR';

var_dump($dom->saveXml());
Foxy answered 9/4, 2014 at 9:8 Comment(2)
I haven't tested whether you can insert DOMDocument objects in the data array but, if you know beforehand what values need fixing¹, this is a quite convoluted workaround :) —— (¹) I didn't know when I asked the question.Negative
Actually this is not a workaround. The second argument in createElement() breaks the W3C DOM spec. The example above is the standard way to add text nodes. The argument in the method is just a shortcut - a broken one.Foxy
R
4

This is in fact because the DOMDocument methods wants correct characters to be outputted in html; that is, characters such as & will break content and generate a unterminated entity reference error

just htmlentities() it before using it to create elements:

$dom = new DOMDocument;
$key = 'sub4';
$childValue = htmlentities('S & T: ERROR');
$dom->createElement($key ,$childValue);
Randi answered 13/11, 2014 at 17:53 Comment(0)
P
0

it is because of this character: & You need to replace that with the relevant HTML entity. &amp; To perform the translation, you can use the htmlspecialchars function. You have to escape the value when writing writing to the nodeValue property. As quoted from a bug report in 2005 located here

ampersands ARE properly encoded when setting the property textContent. Unfortunately they are not encoded when the text string is passed as the optional second arguement to DOMElement::createElement You must create a text node, set the textContent, then append the text node to the new element.

htmlspecialchars($string, ENT_QUOTES, 'UTF-8');

This is the translation table:

'&' (ampersand) becomes '&amp;'
'"' (double quote) becomes '&quot;' when ENT_NOQUOTES is not set.
"'" (single quote) becomes '&#039;' (or &apos;) only when ENT_QUOTES is set.
'<' (less than) becomes '&lt;'
'>' (greater than) becomes '&gt;'

This script will do the translations recursively:

<?php
function clean($type) {
  if(is_array($type)) {
    foreach($type as $key => $value){   
     $type[$key] = clean($value);
    }
    return $type;
  } else {
    $string = htmlspecialchars($type, ENT_QUOTES, 'UTF-8');
    return $string;
  }
}

$data = array(
    'data' => array(
        'root' => array(
            array(
                '@id' => 'A & B: OK',
                'name' => 'C & D: OK',
                'sub1' => array(
                    '@id' => 'E & F: OK',
                    'name' => 'G & H: OK',
                    'sub2' => array(
                        array(
                            '@id' => 'I & J: OK',
                            'name' => 'K & L: OK',
                            'sub3' => array(
                                '@id' => 'M & N: OK',
                                'name' => 'O & P: OK',
                                'sub4' => array(
                                    '@id' => 'Q & R: OK',
                                    '@' => 'S & T: ERROR',
                                ) ,
                            ) ,
                        ) ,
                    ) ,
                ) ,
            ) ,
        ) ,
    ) ,
);

$data = clean($data);

Output

Array
(
    [data] => Array
        (
            [root] => Array
                (
                    [0] => Array
                        (
                            [@id] => A &amp; B: OK
                            [name] => C &amp; D: OK
                            [sub1] => Array
                                (
                                    [@id] => E &amp; F: OK
                                    [name] => G &amp; H: OK
                                    [sub2] => Array
                                        (
                                            [0] => Array
                                                (
                                                    [@id] => I &amp; J: OK
                                                    [name] => K &amp; L: OK
                                                    [sub3] => Array
                                                        (
                                                            [@id] => M &amp; N: OK
                                                            [name] => O &amp; P: OK
                                                            [sub4] => Array
                                                                (
                                                                    [@id] => Q &amp; R: OK
                                                                    [@] => S &amp; T: ERROR
                                                                )

                                                        )

                                                )

                                        )

                                )

                        )

                )

        )

)
Piddle answered 9/4, 2014 at 8:3 Comment(10)
The OP clearly states that it's a warning - but wants to understand the reason for it. Simply ignoring warning is a very bad idea.Carhop
Suppressing the message doesn't fix the underlying problem. It's like sticking your fingers in your ears and singing at the top of your voice.Favouritism
It does fix it in some contexts.Piddle
Are you sure? Docs don't mention the need to pre-process data manually and my example does the right thing in every other value with ampersands...Negative
DOMDocument is a native PHP class, so I believe CakePHP and PHP the programming language are mutually exclusive here.Piddle
How am I supposed to know what nodes will be escaped automatically by CakePHP and which ones need htmlspecialchars? (Mutually exclusive? What do you mean?)Negative
Anytime you are assigning a value to a DOM Node, then you need to escape the special characters, as some characters have special meanings that might indicate that the node has reached its end and should be terminated by a < or a > Since DOMDocument is a native PHP class, this might be an implementation problem on CakePHP's. How much of an abstraction are they providing to the DOMDocument API?Piddle
Your script will fix the S & T value but will double-encode the rest (i.e., will produce literal A &amp; BA &amp;amp; B— values in the resulting XML). I think I found the underlying rule for the misbehaviour, please see my answer.Negative
Yes that is correct behavior. You can always decode the strings back to their original HTML characters.Piddle
How can it be correct to use different encodings depending on something as arbitrary as the node name?Negative
H
-1

The problem seems to be in nodes that have both attributes and values thus need to use the @ syntax:

'@id' => 'A & B: OK',  // <-- Handled as plain text
'name' => 'C & D: OK', // <-- Handled as plain text
'@' => 'S & T: ERROR', // <-- Handled as raw XML

I've written a little helper function:

protected function escapeXmlValue($value){
    return is_null($value) ? null : htmlspecialchars($value, ENT_XML1, 'UTF-8');
}

... and take care of calling it manually when I create the array:

'@id' => 'A & B: OK',
'name' => 'C & D: OK',
'@' => $this->escapeXmlValue('S & T: NOW WORKS FINE'),

It's hard to say if it's bug or feature since the documentation doesn't mention it.

Honig answered 9/4, 2014 at 8:50 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.