xmlperlxlsxxml-twig

Extracting data from Excel XML files using perl XML::Twig handlers when tags (nodes) contain prefix


I use XML::Twig handlers/roots to extract information from large XML files where it costs too much to load the entire file into memory. These XML files are the internal sheet files of Excel .xlsx files.

This method has been working very well up until now. Below is an example of extracting all the cell references from the internal XML file sheet1.xml.

use strict;
use warnings;
use Archive::Zip qw(:ERROR_CODES :CONSTANTS);
use XML::Twig;
use Data::Dumper;

my $zipName='TestFile.xlsx';
my $zip = Archive::Zip->new();
my $zipread;
$zipread=$zip->read($zipName);

my $tw1=new XML::Twig();
my $fileToAnalyse='xl/worksheets/sheet1.xml';
my $sheetFile = $zip->contents($fileToAnalyse);
    
    
my @Results;
my $t= XML::Twig->new(twig_roots => {'worksheet/sheetData/row/c' => 
                  sub { Get_Sheet_Data_TEST_1(@_,\@Results);}})->parse($sheetFile);
print Dumper \@Results;

sub Get_Sheet_Data_TEST_1{
    my($t,$elt,$Results)= @_;

    my @attrib_NAMES=$elt->att_names();
    for my $attrib_loop (0 .. scalar @attrib_NAMES-1){
        if($attrib_NAMES[$attrib_loop] eq 'r'){
            push @$Results,$elt->att($attrib_NAMES[$attrib_loop]);
        }
    }   
    $t->purge; # frees the memory
}

Sometimes these files have prefixes for the tags I am looking for

So

'worksheet/sheetData/row/c'

becomes

'x:worksheet/x:sheetData/x:row/x:c'

Now my handler never fires as it does not find the desired tag.

Is there any way my handlers can be modified, without hardcoding all the possible prefix possibilities, so these prefixes can be matched as well as the "usual" tags with no prefixes?

Perhaps there is a way to find which prefixes are used by any given file in advance and set these values to a variable that I can then pass to my handler.


Solution

  • OK, I've found a solution. Turns out that XML::Twig has an optional argument

    map_xmlns
    

    I can use this to fix my problem. So, my original code

    my $t= XML::Twig->new(twig_roots => {'worksheet/sheetData/row/c' => 
             sub { Get_Sheet_Data_TEST_1(@_,\@Results);}})->parse($sheetFile);
    

    becomes

    my $t= XML::Twig->new(
    map_xmlns => {
            'http://schemas.openxmlformats.org/spreadsheetml/2006/main' => 's'},
    twig_roots => {'s:worksheet/s:sheetData/s:row/s:c' => 
           sub { Get_Sheet_Data_TEST_1(@_,\@Results);}})->parse($sheetFile);
    

    Now my handler works for all prefixes (even empty ones!).

    As written in the XML::Twig documentation:

    map_xmlns

    This option is passed a hashref that maps uri's to prefixes. The prefixes in the document will be replaced by the ones in the map. The mapped prefixes can (actually have to) be used to trigger handlers, navigate or query the document.