Handling large XML files that don't fit in memory is something that XML::Twig
advertises:
One of the strengths of XML::Twig
is that it let you work with files
that do not fit in memory (BTW storing an XML document in memory as a
tree is quite memory-expensive, the expansion factor being often
around 10).
To do this you can define handlers, that will be called once a
specific element has been completely parsed. In these handlers you can
access the element and process it as you see fit (...)
The code posted in the question isn't making use of the strength of XML::Twig
at all (using the simplify
method doesn't make it much better than XML::Simple
).
What's missing from the code are the 'twig_handlers
' or 'twig_roots
', which essentially cause the parser to focus on relevant portions of the XML document memory-efficiently.
It's difficult to say without seeing the XML whether processing the document chunk-by-chunk or just selected parts is the way to go, but either one should solve this issue.
So the code should look something like the following (chunk-by-chunk demo):
use strict;
use warnings;
use XML::Twig;
use List::Util 'sum'; # To make life easier
use Data::Dump 'dump'; # To see what's going on
my %bedrooms; # Data structure to store the wanted info
my $xml = XML::Twig->new (
twig_roots => {
DivisionHouseRoom => \&count_bedrooms,
}
);
$xml->parsefile( 'divisionhouserooms-v3.xml');
sub count_bedrooms {
my ( $twig, $element ) = @_;
my @divParents = $element->children( 'Divisions' );
my $id = $element->first_child_text( 'HouseCode' );
for my $divParent ( @divParents ) {
my @divisions = $divParent->children( 'Division' );
my $total = sum map { $_->text } @divisions;
$bedrooms{$id} = $total;
}
$element->purge; # Free up memory
}
dump \%bedrooms;