Purge XML Twig inside sub handler
Asked Answered
S

1

7

I am parsing large XML files (60GB+) with XML::Twig and using it in a OO (Moose) script. I am using the twig_handlers option to parse elements as soon as they're read into memory. However, I'm not sure how I can deal with the Element and Twig.

Before I used Moose (and OO altogether), my script looked as follows (and worked):

my $twig = XML::Twig->new(
  twig_handlers => {
    $outer_tag => \&_process_tree,
  }
);
$twig->parsefile($input_file);


sub _process_tree {
  my ($fulltwig, $twig) = @_;

  $twig->cut;
  $fulltwig->purge;
  # Do stuff with twig
}

And now I'd do it like this.

my $twig = XML::Twig->new(
  twig_handlers => {
    $self->outer_tag => sub {
      $self->_process_tree($_);
    }
  }
);
$twig->parsefile($self->input_file);

sub _process_tree {
  my ($self, $twig) = @_;

  $twig->cut;
  # Do stuff with twig
  # But now the 'full twig' is not purged
}

The thing is that I now see that I am missing the purging of the fulltwig. I figured that - in the first, non-OO version - purging would help on saving memory: getting rid of the fulltwig as soon as I can. However, when using OO (and having to rely on an explicit sub{} inside the handler) I don't see how I can purge the full twig because the documentation says that

$_ is also set to the element, so it is easy to write inline handlers like

para => sub { $_->set_tag( 'p'); }

So they talk about the Element you want to process, but not the fulltwig itself. So how can I delete that if it is not passed to the subroutine?

Shoemaker answered 23/7, 2017 at 9:29 Comment(0)
A
7

The handler still gets the full twig, you're just not using it (using $_ instead).

As it turns out you can still call purge on the twig (which I usually call "element", or elt in the docs): $_->purge will work as expected, purging the full twig up to the current element in $_;

A cleaner (IMHO) way would be to actually get all of the parameters and purge the full twig expicitely:

my $twig = XML::Twig->new(
  twig_handlers => {
    $self->outer_tag => sub {
      $self->_process_tree(@_); # pass _all_ of the arguments
    }
  }
);
$twig->parsefile($self->input_file);

sub _process_tree {
  my ($self, $full_twig, $twig) = @_; # now you see them!

  $twig->cut;
  # Do stuff with twig
  $full_twig->purge;  # now you don't
}
Arachnoid answered 23/7, 2017 at 10:32 Comment(2)
Aah, my bad! I should've inspected @_ to see what was going on. Thanks! Is there any downside/upside of purging the full twig only after you have done stuff with the cut twig? My reasoning was to purge it immediately after cutting the element, so that memory is cleared as soon as possible. I might be wrong? Great module by the way, we use it all the time!Shoemaker
It should make no difference when you purge. The most important is to reclaim the memory before you start parsing the next subtree. And thanks ;--)Arachnoid

© 2022 - 2024 — McMap. All rights reserved.