Perl, XML::Twig, how to reading field with the same tag
Asked Answered
B

3

6

I'm working on processing a XML file I receive from a partner. I do not have any influence on changing the makeup of this xml file. An extract of the XML is:

<?xml version="1.0" encoding="UTF-8"?>
<objects>
  <object>
    <id>VW-XJC9</id>
    <name>Name</name>
    <type>House</type>
    <description>
    <![CDATA[<p>some descrioption of the house</p>]]> </description>
    <localcosts>
      <localcost>
        <type>mandatory</type>
        <name>What kind of cost</name>
        <description>
          <![CDATA[Some text again, different than the first tag]]>
        </description>
      </localcost>
    </localcosts>
  </object>
</objects>

The reason I use Twig is that this XML is about 11GB big, about 100000 different objects) . The problem is when I reach the localcosts part, the 3 fields (type, name and description) are skipped, probably because these names are already used before.

The code I use to go through the xml file is as follows:

my $twig= new XML::Twig( twig_handlers => { 
                 id                            => \&get_ID,
                 name                          => \&get_Name,
                 type                          => \&get_Type,
                 description                   => \&get_Description,
                 localcosts                    => \&get_Localcosts
});

$lokaal="c:\\temp\\data3.xml";
getstore($xml, $lokaal);
$twig->parsefile("$lokaal");

sub get_ID          { my( $twig, $data)= @_;  $field[0]=$data->text; $twig->purge; } 
sub get_Name        { my( $twig, $data)= @_;  $field[1]=$data->text; $twig->purge; }
sub get_Type        { my( $twig, $data)= @_;  $field[3]=$data->text; $twig->purge; }
sub get_Description { my( $twig, $data)= @_;  $field[8]=$data->text; $twig->purge; }
sub get_Localcosts{

  my ($t, $item) = @_;

  my @localcosts = $item->children;
  for my $localcost ( @localcosts ) {
    print "$field[0]: $localcost->text\n";
    my @costs = $localcost->children;
    for my $cost (@costs) {
      $Type       =$cost->text if $cost->name eq q{type};
      $Name       =$cost->text if $cost->name eq q{name};
      $Description=$cost->text if $cost->name eq q{description};
      print "Fields: $Type, $Name, $Description\n";
    }
  }
  $t->purge;    
}

when I run this code, the main fields are read without issues, but when the code arrives at the 'localcosts' part, the second for-next loop is not executed. When I change the field names in the xml to unique ones, this code works perfectly.

Can someone help me out?

Thanks

Balmoral answered 8/6, 2014 at 14:2 Comment(0)
B
4

If you want the handlers for type, name and desctiption only be triggered in the object tag, specify the path:

my $twig = new XML::Twig( twig_handlers => { 
                 id                    => \&get_ID,
                 'object/name'         => \&get_Name,
                 'object/type'         => \&get_Type,
                 'object/description'  => \&get_Description,
                 localcosts            => \&get_Localcosts
    });
Bora answered 8/6, 2014 at 15:22 Comment(2)
Hi Choroba, Thanks, this works! I tried this solution with the field of the lcoalcosts but that did not work. But this does! Super!Balmoral
Your other fields should look like localcost/type, localcost/name etc. At a guess you were using localcosts/type? You could use localcosts/localcost/type but there's no needEthridge
E
4

The problem is that the id, name, type and description handlers are being executed for both occurrences. You will find that the contents of the @fields is from the localcost values, as the data from the object values has been overwritten.

Also, in handling the localcost elements, the handlers have done a $twig->purge, which removes the data from memory. So when the localcosts handler is called it finds the element empty

I think the easiest way to do this is to write a single handler that processes each object node in one go and then purges it

This program demonstrates. Note that I have used Data::Dumper only so that you can see the contents of @fields once it has been populated

It is very important that you use strict and use warnings at the top of every Perl program, especially if you are asking for help with it. It is a simple measure that can reveal many straightforward errors that you may otherwise waste a lot of time searching for

Note also that the "indirect object" form of method calls is discouraged: you should write XML::Twig->new(...) instead of new XML::Twig (...).

And if you use single quotes instead of double quotes then a backslash inside a string doesn't need to be doubled-up unless it is the last character of the string. But Perl is quite happy if you use forward slashes as a path separator, even on Windows

I hope this helps

use strict;
use warnings;

use XML::Twig;
use Data::Dumper;
$Data::Dumper::Useqq = 1;

my $twig= XML::Twig->new( twig_handlers => { object => \&get_Object });

my $lokaal = 'c:\temp\data3.xml';

my @fields;
$twig->parsefile($lokaal);


sub get_Object {

  my ($twig, $object) = @_;

  $fields[0] = $object->findvalue('id');
  $fields[1] = $object->findvalue('name');
  $fields[3] = $object->findvalue('type');
  $fields[8] = $object->findvalue('description');

  print Dumper \@fields;

  my @localcosts = $object->findnodes('localcosts/localcost');

  for my $localcost (@localcosts) {

    my $type        = $localcost->findvalue('type');
    my $name        = $localcost->findvalue('name');
    my $description = $localcost->findvalue('description');

    print "$type, $name, $description\n";
  }

  $twig->purge;    
}

output

$VAR1 = [
          "VW-XJC9",
          "Name",
          undef,
          "House",
          undef,
          undef,
          undef,
          undef,
          "<p>some descrioption of the house</p> "
        ];
mandatory, What kind of cost, Some text again, different than the first tag
Ethridge answered 8/6, 2014 at 15:28 Comment(2)
Hi Borodin, this looks like a very nice solution. THis will take me some time to rewrite the module I have, but I like this idea. It's very tidy. Thanks for your help!Balmoral
@user2970543: I'm pleased that it helped you. You need to do a lot of work with the more complex libraries like XML::Tiwg before the best technique for a given situation becomes apparentEthridge
I
2

As Borodin said, if you have handlers on name, type and description, and you call $twig->purge at the end of each handler, then the elements are removed from the tree. You could set a handler on object, that only does a $twig->purge call, and you would be OK.

You don't need to call purge "too often", just make sure you call it at a low enough level so you don't use too much memory. There is no point really in calling it for each single leaf element.

That's a common mistake, one that I make myself quite often ;--(.

Impolitic answered 8/6, 2014 at 17:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.