How to list XML node attributes with XML::LibXML?
Asked Answered
S

2

6

Given the following XML snippet:

<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>

How do I get this output?

outline
node1=text1
node1 attribute1=value1
node1 attribute2=value2

I have looked into use XML::LibXML::Reader;, but that module appears to only provide access to attribute values referenced by their names. And how do I get the list of attribute names in the first place?

Sharpshooter answered 7/11, 2014 at 7:34 Comment(0)
A
5

You find the list of attributes by doing $e->findnodes( "./@*");

Below is a solution, with plain XML::LibXML, not XML::LibXML::Reader, that works with your test data. It may be sensitive to extra whitespace and mixed-content though, so test it on real data before using it.

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;

my $dom= XML::LibXML->load_xml( IO => \*DATA);
my $e= $dom->findnodes( "//*");

foreach my $e (@$e)
  { print $e->nodeName;

    # text needs to be trimmed or line returns show up in the output
    my $text= $e->textContent;
    $text=~s{^\s*}{};
    $text=~s{\s*$}{};

    if( ! $e->getChildrenByTagName( '*') && $text)
      { print "=$text"; }
    print "\n"; 

    my @attrs= $e->findnodes( "./@*");
    # or, as suggested by Borodin below, $e->attributes

    foreach my $attr (@attrs)
      { print $e->nodeName, " ", $attr->nodeName. "=", $attr->value, "\n"; }
  }
__END__
<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>
Ancient answered 7/11, 2014 at 8:10 Comment(7)
There are much cleaner ways to fetch the attributes. The obvious is my @attrs = $e->attributes, which returns a list of all attribute nodes, but an element node object also behaves as a tied hash reference, and keys %$e will return all of the attribute names while $e->{attr_name} will return the value of attribute attr_name.Retention
thanks, I didn't find this in the docs, which I thought was strange. And now I see it, under "Overloading", duh! I still don't see attributes though, at least in the docs for XML::LibXML::ElementAncient
I see, I wasn't expecting to find it there. Actually it makes no sense at all. I see that it is also used to return the list of namespace declarations associated with the node, WTF? Why 1 method for 2 extremely different results? I can't even find it in the DOM spec... Boy I'm glad I use XML::Twig ;--)Ancient
The border between XML::LibXML::Element and XML::LibXML::Node is a little strange. I would expect all attribute stuff to appear in the former as no other node type can have attributes. But the namespace declarations is kinda okay: a namespace looks just like an attribute called xmlns.Retention
agreed, indeed with findnodes( "./@*") (ir using %$e) you don't get the namespace declarations, while attributes gives them to you. And before testing, I thought that attributes would return a list of all namespace declarations that applied to a node, not just the ones declared in the start tag of the element.Ancient
It has been on my list of things to do -- towards the bottom, in the section marked "interesting" -- to examine and understand the libxml2 library on which this is based: exercises like that always enhance my understanding of related software. I hope to find that strangenesses like this one in the Perl glue library are mainly due to our vision being forced through the fat lenses of the author's spectacles.Retention
Thank you very much! I like both solutions: Borodin's for the use of attributes and mirod's for unifying approach to nodes walking with findnodes( "//*"). (Sorry, my question was badly composed, the <outline> is basically an ordinary node, just like <node1>, so what I really needed was a recursive walk over the whole document.) You've done a good job at clarifying the Perl docs too ;)Sharpshooter
R
6

Something like this should help you.

It's not clear from your question whether <outline> is the root element of the data, or if it is buried somewhere in a bigger document. It's also unclear how general you want the solution to be - e.g. do you want the entire document dumped in this manner?

Anyway, this program generates the output you requested from the given XML input in a fairly concise manner.

use strict;
use warnings;
use 5.014;     #' For /r non-destructive substitution mode

use XML::LibXML;

my $xml = XML::LibXML->load_xml(IO => \*DATA);

my ($node) = $xml->findnodes('//outline');

print $node->nodeName, "\n";

for my $child ($node->getChildrenByTagName('*')) {
  my $name = $child->nodeName;

  printf "%s=%s\n", $name, $child->textContent =~ s/\A\s+|\s+\z//gr;

  for my $attr ($child->attributes) {
    printf "%s %s=%s\n", $name, $attr->getName, $attr->getValue;
  }
}

__DATA__
<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>

output

outline
node1=text1
node1 attribute1=value1
node1 attribute2=value2
Retention answered 7/11, 2014 at 11:8 Comment(0)
A
5

You find the list of attributes by doing $e->findnodes( "./@*");

Below is a solution, with plain XML::LibXML, not XML::LibXML::Reader, that works with your test data. It may be sensitive to extra whitespace and mixed-content though, so test it on real data before using it.

#!/usr/bin/perl

use strict;
use warnings;

use XML::LibXML;

my $dom= XML::LibXML->load_xml( IO => \*DATA);
my $e= $dom->findnodes( "//*");

foreach my $e (@$e)
  { print $e->nodeName;

    # text needs to be trimmed or line returns show up in the output
    my $text= $e->textContent;
    $text=~s{^\s*}{};
    $text=~s{\s*$}{};

    if( ! $e->getChildrenByTagName( '*') && $text)
      { print "=$text"; }
    print "\n"; 

    my @attrs= $e->findnodes( "./@*");
    # or, as suggested by Borodin below, $e->attributes

    foreach my $attr (@attrs)
      { print $e->nodeName, " ", $attr->nodeName. "=", $attr->value, "\n"; }
  }
__END__
<outline>
  <node1 attribute1="value1" attribute2="value2">
    text1
  </node1>
</outline>
Ancient answered 7/11, 2014 at 8:10 Comment(7)
There are much cleaner ways to fetch the attributes. The obvious is my @attrs = $e->attributes, which returns a list of all attribute nodes, but an element node object also behaves as a tied hash reference, and keys %$e will return all of the attribute names while $e->{attr_name} will return the value of attribute attr_name.Retention
thanks, I didn't find this in the docs, which I thought was strange. And now I see it, under "Overloading", duh! I still don't see attributes though, at least in the docs for XML::LibXML::ElementAncient
I see, I wasn't expecting to find it there. Actually it makes no sense at all. I see that it is also used to return the list of namespace declarations associated with the node, WTF? Why 1 method for 2 extremely different results? I can't even find it in the DOM spec... Boy I'm glad I use XML::Twig ;--)Ancient
The border between XML::LibXML::Element and XML::LibXML::Node is a little strange. I would expect all attribute stuff to appear in the former as no other node type can have attributes. But the namespace declarations is kinda okay: a namespace looks just like an attribute called xmlns.Retention
agreed, indeed with findnodes( "./@*") (ir using %$e) you don't get the namespace declarations, while attributes gives them to you. And before testing, I thought that attributes would return a list of all namespace declarations that applied to a node, not just the ones declared in the start tag of the element.Ancient
It has been on my list of things to do -- towards the bottom, in the section marked "interesting" -- to examine and understand the libxml2 library on which this is based: exercises like that always enhance my understanding of related software. I hope to find that strangenesses like this one in the Perl glue library are mainly due to our vision being forced through the fat lenses of the author's spectacles.Retention
Thank you very much! I like both solutions: Borodin's for the use of attributes and mirod's for unifying approach to nodes walking with findnodes( "//*"). (Sorry, my question was badly composed, the <outline> is basically an ordinary node, just like <node1>, so what I really needed was a recursive walk over the whole document.) You've done a good job at clarifying the Perl docs too ;)Sharpshooter

© 2022 - 2024 — McMap. All rights reserved.