Resolve namespaces with SimpleXML regardless of structure or namespace
Asked Answered
L

2

5

I got a Google Shopping feed like this (extract):

<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">  
...
<g:id><![CDATA[Blah]]></g:id>
<title><![CDATA[Blah]]></title>
<description><![CDATA[Blah]]></description>
<g:product_type><![CDATA[Blah]]></g:product_type>

Now, SimpleXML can read the "title" and "description" tags but it can't read the tags with "g:" prefix.

There are solutions on stackoverflow for this specific case, using the "children" function. But I don't only want to read Google Shopping XMLs, I need it to be undependend from structure or namespace, I don't know anything about the file (I recursively loop through the nodes as an multidimensional array).

Is there a way to do it with SimpleXML? I could replace the colons, but I want to be able to store the array and reassemble the XML (in this case specifically for Google Shopping) so I do not want to lose information.

Leatherback answered 16/10, 2014 at 9:35 Comment(6)
If you don't know anything about the structure, what use is iterating over it? At some point, you have to actually extract meaning from it. To put it a different way: what is your required result?Corsiglia
Regarding the http://base.google.com/ns/1.0 URL, a namespace identifier doesn't actually have to resolve to anything useful, it just has to be unique. See https://mcmap.net/q/121530/-values-for-namespace-in-xmlns-attributeCorsiglia
I look for a list of products and recognize it by looking for reappearing keys.Leatherback
So you are looking for nodes with a particular string in them?Corsiglia
No, I look for similar keys in sub arrays, completely dynamic.Leatherback
I don't understand how you expect the code to work without knowing anything of the structure of the XML. If you just look for two nodes which are similar to each other, how will you know that those are the strings you want, and not some completely irrelevant detail? Perhaps you could give an example of the kind of thing you want to achieve. It's certainly possible to iterate over every node of an XML document (both using SimpleXML, and using other types of parser), but it's hard to suggest an approach without knowing what you need to do with the data.Corsiglia
G
18

You want to use SimpleXMLElement to extract data from XML and convert it into an array.

This is generally possible but comes with some caveats. Before XML Namespaces your XML comes with CDATA. For XML to array conversion with Simplexml you need to convert CDATA to text when you load the XML string. This is done with the LIBXML_NOCDATA flag. Example:

$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA);
print_r($xml); // print_r shows how SimpleXMLElement does array conversion

This gives you the following output:

SimpleXMLElement Object
(
    [@attributes] => Array
        (
            [version] => 2.0
        )

    [title] => Blah
    [description] => Blah
)

As you can already see, there is no nice form to present the attributes in an array, therefore Simplexml by convention puts these into the @attributes key.

The other problem you have is to handle those multiple XML namespaces. In the previous example no specific namespace was used. That is the default namespace. When you convert a SimpleXMLElement to an array, the namespace of the SimpleXMLElement is used. As none was explicitly specified, the default namespace has been taken.

But if you specify a namespace when you create the array, that namespace is taken.

Example:

$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA, "http://base.google.com/ns/1.0");
print_r($xml);

This gives you the following output:

SimpleXMLElement Object
(
    [id] => Blah
    [product_type] => Blah
)

As you can see, this time the namespace that has been specified when the SimpleXMLElement was created is used in the array conversion: http://base.google.com/ns/1.0.

As you write you want to take all namespaces from the document into account, you need to obtain those first - including the default one:

$xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA);
$namespaces = [null] + $xml->getDocNamespaces(true);

Then you can iterate over all namespaces and recursively merge them into the same array shown below:

$array = [];
foreach ($namespaces as $namespace) {
    $xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA, $namespace);
    $array = array_merge_recursive($array, (array) $xml);
}
print_r($array);

This then finally should create and output the array of your choice:

Array
(
    [@attributes] => Array
        (
            [version] => 2.0
        )

    [title] => Blah
    [description] => Blah
    [id] => Blah
    [product_type] => Blah
)

As you can see, this is perfectly possible with SimpleXMLElement. However it's important you understand how SimpleXMLElement converts into an array (or serializes to JSON which does follow the same rules). To simulate the SimpleXMLElement-to-array conversion, you can make use of print_r for a quick output.

Note that not all XML constructs can be equally well converted into an array. That's not specifically a limitation of Simplexml but lies in the nature of which structures XML can represent and which structures an array can represent.

Therefore it is most often better to keep the XML inside an object like SimpleXMLElement (or DOMDocument) to access and deal with the data - and not with an array.

However it's perfectly fine to convert data into an array as long as you know what you do and you don't need to write much code to access members deeper down the tree in the structure. Otherwise SimpleXMLElement is to be favored over an array because it allows dedicated access not only to many of the XML feature but also querying like a database with the SimpleXMLElement::xpath method. You would need to write many lines of own code to access data inside the XML tree that comfortable on an array.

To get the best of both worlds, you can extend SimpleXMLElement for your specific conversion needs:

$buffer = <<<BUFFER
<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0" xmlns:g="http://base.google.com/ns/1.0">
...
<g:id><![CDATA[Blah]]></g:id>
<title><![CDATA[Blah]]></title>
<description><![CDATA[Blah]]></description>
<g:product_type><![CDATA[Blah]]></g:product_type>
</rss>
BUFFER;

$feed = new Feed($buffer, LIBXML_NOCDATA);
print_r($feed->toArray());

Which does output:

Array
(
    [@attributes] => stdClass Object
        (
            [version] => 2.0
        )

    [title] => Blah
    [description] => Blah
    [id] => Blah
    [product_type] => Blah
    [@text] => ...
)

For the underlying implementation:

class Feed extends SimpleXMLElement implements JsonSerializable
{
    public function jsonSerialize()
    {
        $array = array();

        // json encode attributes if any.
        if ($attributes = $this->attributes()) {
            $array['@attributes'] = iterator_to_array($attributes);
        }

        $namespaces = [null] + $this->getDocNamespaces(true);
        // json encode child elements if any. group on duplicate names as an array.
        foreach ($namespaces as $namespace) {
            foreach ($this->children($namespace) as $name => $element) {
                if (isset($array[$name])) {
                    if (!is_array($array[$name])) {
                        $array[$name] = [$array[$name]];
                    }
                    $array[$name][] = $element;
                } else {
                    $array[$name] = $element;
                }
            }
        }

        // json encode non-whitespace element simplexml text values.
        $text = trim($this);
        if (strlen($text)) {
            if ($array) {
                $array['@text'] = $text;
            } else {
                $array = $text;
            }
        }

        // return empty elements as NULL (self-closing or empty tags)
        if (!$array) {
            $array = NULL;
        }

        return $array;
    }

    public function toArray() {
        return (array) json_decode(json_encode($this));
    }
}

Which is an adoption with namespaces of the Changing JSON Encoding Rules example given in SimpleXML and JSON Encode in PHP – Part III and End.

Guardado answered 16/10, 2014 at 23:11 Comment(11)
I thinkthe part of this I agree most with is "it is most often better to keep the XML inside an object like SimpleXMLElement (or DOMDocument) to access and deal with the data - and not with an array.". All the hacks required to produce an array with absolutely everything in go away if you identify the information you actually need and use the available APIs to retrieve that informationCorsiglia
@IMSoP: Sure, that's a central sentence in the answer and I put it there because it's important to reflect. I think the other part is to show that "completely dynamic" is not easily feasible, too. But on the other hand, it is also an example how the namespace parameter(s) of the SimpleXMLElement constructor actually works - the array conversion is more the use case and most likely a detail even.Guardado
Thank you for your extensive explanation! :) The problem is, if I use the $xml = simplexml_load_string($buffer, null, LIBXML_NOCDATA, "http://base.google.com/ns/1.0"); part I get an empty object, without the namespace I still get the normal results without namespace. How does it know it belongs to the "g:" prefix?Leatherback
In addition I have to say I use a string from a file, not the buffer declaration, hope that doesn't cause any problems.Leatherback
@phogl: the g: prefix is just a prefix. more important is the URI. And for file usage there is simplexml_load_file or the optional third parameter in SimpleXMLElement::__construct - depending on which kind of creating the object you prefer.Guardado
@Guardado Unfortunately the command gives me an empty object if I add the namespace parameter, without it works and returns the non-prefixed nodes. Could the reason be that there is nothing behind http://base.google.com/ns/1.0? (I will read your answer on Monday)Leatherback
@Guardado I found out, your command works only if the prefixed nodes are on the top-level and not in child nodes.Leatherback
what do you mean by only, that was as you presented the XML in your question. Feed::toArray is for the general case if you need all children any depth.Guardado
@phogl: And as others have explained: "http://base.google.com/ns/1.0" is an URI. An URI must not be anywhere in the internet, having an URI must mean, you should not assume it exists. Compare with RFC 3986. So, no, it has nothing to do that there is nothing behind that string (wherever that behind of a string is) :). Good to read you managed to achieve what you were looking for.Guardado
I was hoping that this solution would preserve the order of the tags, but it does not appear so. Is there no way to achieve that, save from writing an XML parser from scratch?Latricelatricia
@TedPhillips: Good question. Instead of the children() method, xpath over all child-nodes might be handy here. Have not tested it, it should be xpath("*") giving all child-nodes (elements).Guardado
D
3

The answer given by hakre was well written and exactly what I was looking for, especially the Feed class he provided at the end. But it was incomplete in a couple of ways, so I modified his class to be more generically useful and wanted to share the changes:

  • One of the most important issues which was missed in the original is that Attributes may also have namespaces, and without taking that into account, you are quite likely to miss attributes on elements.

  • The other bit which is important is that when converting to an array, if you have something which may contain elements of the same name but different namespaces, there is no way to tell which namespace the element was from. (Yes, it's a really rare situation... but I ran into it with a government standard based on NIEM...) So I added a static option which will cause the namespace prefix to be added to all keys in the final array that belong to a namespace. To use it, set Feed::$withPrefix = true; before calling toArray()

  • Finally, more for my own preferences, I added an option to toArray() to return the final array as associative instead of using objects.

Here's the updated class:

class Feed extends \SimpleXMLElement implements \JsonSerializable
{
    public static $withPrefix = false;

    public function jsonSerialize()
    {
        $array = array();
        $attributes = array();

        $namespaces = [null] + $this->getDocNamespaces(true);

        // json encode child elements if any. group on duplicate names as an array.
        foreach ($namespaces as $prefix => $namespace) {
            foreach ($this->attributes($namespace) as $name => $attribute) {
                if (static::$withPrefix && !empty($namespace)) {
                    $name = $prefix . ":" . $name;
                }
                $attributes[$name] = $attribute;
            }

            foreach ($this->children($namespace) as $name => $element) {
                if (static::$withPrefix && !empty($namespace)) {
                    $name = $prefix . ":" . $name;
                }
                if (isset($array[$name])) {
                    if (!is_array($array[$name])) {
                        $array[$name] = [$array[$name]];
                    }
                    $array[$name][] = $element;
                } else {
                    $array[$name] = $element;
                }
            }
        }

        if (!empty($attributes)) {
            $array['@attributes'] = $attributes;
        }

        // json encode non-whitespace element simplexml text values.
        $text = trim($this);
        if (strlen($text)) {
            if ($array) {
                $array['@text'] = $text;
            } else {
                $array = $text;
            }
        }

        // return empty elements as NULL (self-closing or empty tags)
        if (!$array) {
            $array = NULL;
        }

        return $array;
    }

    public function toArray($assoc=false) {
        return (array) json_decode(json_encode($this), $assoc);
    }
}
Drava answered 20/11, 2020 at 21:19 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.