I may be asking a basic question but it's killing me.
Following is my code snippet
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = new XML::Twig( twig_handlers => { TRADE => \&TRADE } );
$twig->parsefile('1510.xml');
$twig->set_pretty_print('indented');
$twig->print_to_file('out.xml');
sub TRADE {
my ( $twig, $TRADE ) = @_;
#added delete in place of cut
$TRADE->cut($TRADE) unless
$TRADE->att('origin') eq "COMPUTER";
}
This is working as expected. It is giving me all TRADES having 'origin' equals 'COMPUTER'.
But I need to handle XML files spanning to 1 GB. In that case it 'segmentation error' as it consumes huge memory.
Hence, in order to resolve the issue I am trying to implement 'purge' concept of XML::Twig
Hence I modified the code to :
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
my $twig = new XML::Twig( twig_handlers => { TRADE => \&TRADE } );
$twig->parsefile('1510.xml');
$twig->set_pretty_print('indented');
$twig->print_to_file('out.xml');
sub TRADE {
my ( $twig, $TRADE ) = @_;
#added delete in place of cut
$TRADE->cut($TRADE) unless
$TRADE->att('origin') eq "COMPUTER";
$twig->purge;
}
This is giving me empty file. I am trying to flush those twigs which are used in order to use memory efficiently.
I don't know why it is giving me blank output file.
Sample XML :
<TRADEEXT>
<TRADE origin = 'COMPUTER'/>
<TRADE origin = 'COMP'/>
<TRADE origin = 'COMPP'/>
</TRADEEXT>
output file:
<TRADEEXT>
<TRADE origin = 'COMPUTER'/>
</TRADEEXT>
You should probably use flush
(to a filehandle) instead of purge
: flush
outputs the twig that has been parsed so far and frees the memory, while purge
only frees the memory.
That said, if all you want is to remove the TRADE
elements that don't have the proper attribute, you could do something like this:
#!/usr/bin/perl
use strict;
use warnings;
use XML::Twig;
open( my $out, '>:utf8', "out.xml") or die "cannot create output file out.xml: $!";
my $twig = XML::Twig->new( pretty_print => 'indented',
twig_roots => { 'TRADE[@origin != "COMPUTER"]'
=> sub { $_->delete; }
},
twig_print_outside_roots => $out,
)
->parsefile('1510.xml');
This will leave some extra empty lines in the file, you can remove them later. The twig_roots
handler is triggered for all elements you need to remove, and it deletes them, while the twig_print_outside_roots
option causes all other elements to be printed as_is.