Perl XML::LibXML: how to access comment nodes

N

3

6

For the life of me I can't figure out the proper code to access the comment lines in my XML file. Do I use findnodes, find, getElementByTagName (doubt it).

Am I even making the correct assumption that these comment lines are accessible? I would hope so, as I know I can add a comment.

The type number for a comment node is 8, so they must be parseable.

Ultimately, what I want tot do is delete them.

my @nodes = $dom->findnodes("//*");

foreach my $node (@nodes) {
  print $node->nodeType, "\n";
}

<TT>
 <A>xyz</A>
 <!-- my comment -->
</TT>

Nard answered 17/10, 2013 at 16:28 Comment(2)

You should be able to use XML::LibXML::Reader to step through each node and skip the ones that are of type 8 (comment). Would that work for you? – Tolidine 17/10, 2013 at 16:37

ug, don't use 8, use XML_READER_TYPE_COMMENT (from use XML::LibXML::Reader qw( XML_READER_TYPE_COMMENT );) – Mikkimiko 17/10, 2013 at 17:6

M

8

According to the XPath spec:

* is a test that matches element nodes of any name. Comment nodes aren't element nodes.
comment() is a test that matches comment nodes.

Untested:

for $comment_node ($doc->findnodes('//comment()')) {
   $comment_node->parentNode->removeChild($comment_node);
}

Mikkimiko answered 17/10, 2013 at 17:1 Comment(5)

thanks @Mikkimiko ... So the comment line is associated with the Parent node above. Can I get the children of the comment? – Nard 17/10, 2013 at 17:28

@CraigP: Only element nodes have children. What is confusing you is the awkward method that has been used to remove the comment node, which find the parent of the comment and calls removeChild to remove the comment. Take a look at my answer for a clearer and more concise way. – Blate 17/10, 2013 at 21:12

@Borodin, Document and DocumentFragment nodes can also have children. Having to learn an obscure function that moves nodes to some obscure DocumentFragment object is not clearer in my opinion. One should never have to deal with either of those things! – Mikkimiko 17/10, 2013 at 21:35

@ikegami: Both of those node types are a little esoteric in that they always have exactly one child, being the document root element. I think it's more confusing than elightening to consider those. Likewise the temporary parking place for nodes removed from the main document with unbindNode, which is clearly intended to be transparent. – Blate 17/10, 2013 at 22:6

@Borodin, Which one is it? esoteric or transparent? I believe that DocumentFragment are is the former, yet you claim a function that requires understanding them to be clearer then using one called removeChild to remove a node. – Mikkimiko 17/10, 2013 at 23:30

B

9

If all you need to do is produce a copy of the XML with comment nodes removed, then the first parameter of toStringC14N is a flag that says whether you want comments in the output. Omitting all parameters implicitly sets the first to a false value, so
```
$doc->toStringC14N
```

will reproduce the XML trimmed of comments. Note that the Canonical XML form specified by C14N doesn't include an XML declaration header. It is always XML 1.0 encoded in UTF-8.

If you need to remove the comments from the in-memory structure of the document before processing it further, then findnodes with the XPath expression //comment() will locate them for you, and unbindNode will remove them from the XML.

This program demonstrates

use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML->load_xml(string => <<END_XML);
<TT>
 <A>xyz</A>
 <!-- my comment -->
</TT>
END_XML

# Print everything
print $doc->toString, "\n";

# Print without comments
print $doc->toStringC14N, "\n\n";

# Remove comments and print everything
$_->unbindNode for $doc->findnodes('//comment()');
print $doc->toString;

output

<?xml version="1.0"?>
<TT>
 <A>xyz</A>
 <!-- my comment -->
</TT>

<TT>
 <A>xyz</A>

</TT>

<?xml version="1.0"?>
<TT>
 <A>xyz</A>

</TT>

Update

To select a specific comment, you can add a predicate expression to the XPath selector. To find the specific comment in your example data you could write

$doc->findnodes('//comment()[. = " my comment "]')

Note that the text of the comment includes everything except the leading and trailing --, so spaces are significant as shown in that call.

If you want to make things a bit more lax, you could use normalize=space, which removes leading and trailing whitespace, and contracts every sequence of whitespace within the string to a single space. Now you can write

$doc->findnodes('//comment()[normalize-space(.) = "my comment"]')

And the same call would find your comment even if it looked like this.

<!--
my
comment
-->

Finally, you can make use of contains, which, as you would expect, simply checks whether one string contains another. Using that you could write

$doc->findnodes('//comment()[contains(., "comm")]')

The one to choose depends on your requirement and your situation.

Blate answered 17/10, 2013 at 17:10 Comment(2)

How do I xPATH a particular comment? e.g. - foreach my $CN ($dom->findnodes('//comment()="COM1"')) I've tried [comment()="COM1"] , even with "<1-- COM1 -->" .... nothing seems to work. – Nard 17/10, 2013 at 18:30

@CraigP: I have added to my answer to explain. You probably want //comment()[normalize-space(.) = "COM1"] – Blate 17/10, 2013 at 21:29

M

8

According to the XPath spec:

* is a test that matches element nodes of any name. Comment nodes aren't element nodes.
comment() is a test that matches comment nodes.

Untested:

for $comment_node ($doc->findnodes('//comment()')) {
   $comment_node->parentNode->removeChild($comment_node);
}

Mikkimiko answered 17/10, 2013 at 17:1 Comment(5)

thanks @Mikkimiko ... So the comment line is associated with the Parent node above. Can I get the children of the comment? – Nard 17/10, 2013 at 17:28

@CraigP: Only element nodes have children. What is confusing you is the awkward method that has been used to remove the comment node, which find the parent of the comment and calls removeChild to remove the comment. Take a look at my answer for a clearer and more concise way. – Blate 17/10, 2013 at 21:12

@Borodin, Document and DocumentFragment nodes can also have children. Having to learn an obscure function that moves nodes to some obscure DocumentFragment object is not clearer in my opinion. One should never have to deal with either of those things! – Mikkimiko 17/10, 2013 at 21:35

@ikegami: Both of those node types are a little esoteric in that they always have exactly one child, being the document root element. I think it's more confusing than elightening to consider those. Likewise the temporary parking place for nodes removed from the main document with unbindNode, which is clearly intended to be transparent. – Blate 17/10, 2013 at 22:6

@Borodin, Which one is it? esoteric or transparent? I believe that DocumentFragment are is the former, yet you claim a function that requires understanding them to be clearer then using one called removeChild to remove a node. – Mikkimiko 17/10, 2013 at 23:30

A

2

I know it's not XML::LibXML but here you have another way to remove comments easily with XML::Twig module:

#!/usr/bin/env perl

use warnings;
use strict;
use XML::Twig;

my $twig = XML::Twig->new(
    pretty_print => 'indented',
    comments => 'drop'
)->parsefile( shift )->print;

Run it like:

perl script.pl xmlfile

That yields:

<TT>
  <A>xyz</A>
</TT>

The comments option has also the value process that lets you work with them using the xpath value of #COMMENT.

Answer answered 17/10, 2013 at 16:35 Comment(0)

Recommended topics

Hot tags