xmlperlgrepxml-twig

How to parse xml file with perl based on attribute using grep


I'm new to perl and have been struggling. I have an xml file with the following structure, but with thousands of entries:

test.xml

<msms_pipeline_analysis>
    <spectrum_query spectrum="H_TPP08.04885.04885.2" start_scan="4885" end_scan="48887">
        <search_result>
          <search_hit calc_neutral_pep_mass="2348.060995306391" hit_rank="1">
          </search_hit>
        </search_result>
    </spectrum_query>
    <spectrum_query spectrum="L_TPP08.05765.04785.2" start_scan="4885" end_scan="48856">
        <search_result>
          <search_hit calc_neutral_pep_mass="2348.060995306391" hit_rank="1">
          </search_hit>
        </search_result>        
    </spectrum_query>
    <spectrum_query spectrum="L_TPP10.87945.3485.2" start_scan="4885" end_scan="4885">
        <search_result>
          <search_hit calc_neutral_pep_mass="2348.060995306391" hit_rank="1">
          </search_hit>
        </search_result>        
    </spectrum_query>
</msms_pipeline_analysis>

I need to parse/delete the "spectrum_query" nodes that do not contain in the attribute "spectrum" the string in this example "TPP08" i.e. in reality what it is between the first underscore and the first dot (as later I would like to subset TPP09, TPP10, etc), eg.

H_TPP08.04885.04885.2

and retain the file with its structure.

By searching I have come up with many solutions that look at deleting nodes fulfilling an attribute. In my case, such solution can delete a node in question:

#!/urs/bin/env perl
use strict;
use warnings;
use XML::Twig;

my $twig = XML::Twig -> new ('pretty_print' => 'indented' ) -> parsefile ( 'test.xml' ); 
foreach my $element ( $twig -> get_xpath('spectrum_query[@spectrum="H_TPP08.04885.04885.2"]') ) {
   $element -> delete;
}

$twig -> print; 

open XML, ">output.xml";
print XML $twig->toString();
close XML;

which deletes the first node. But only the specific one, and the real file has thousands of entries. Moreover, I want to keep the ones that fulfill the criteria, as the other way around I would have to run the script for every other entry that does not contain spectrum TPP08 (eg TPP09, TPP10, etc).

As to determine the string, so far I have come with this

$string = qw(H_TPP08.05164.05164.2);
my ($substring2) = $string =~ m:.*_(.+?)?\.:;
print "$substring2\n";

Which outputs TPP08 what I want, as I would need to keep the nodes with H_TPP08.XXXX or L_TPP08.XXXX

So far I have not found if there's a way to do a negative subset like in R with "!" of the grep, and include the grep in the matching of the string on the attribute so this can be parsed. For what I have read most likely I would need to make an array with the string of the attribute of all entries

my @array = map { $tag -> att('spectrum') } $twig -> get_xpath('//spectrum_query');

and then evaluate sequentially each entry after grep and compare it to the matching string, and then only keep the nodes fulfilling that. But I cannot wrap my head around a solution for that with my basic perl knowledge.

Any help will be really appreciated! Thanks


Solution

  • The most "twiggish" way to do this would be to go through the file and discard the elements you don't want while outputting the rest .

    This will be very memory efficient, since pretty much nothing will be kept in memory.

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    use autodie qw(open);
    
    use XML::Twig;
    
    my $target = 'TPP08';
    my $input  = 'test.xml';
    my $output = 'output.xml';
    open( my $out, '>:utf8', $output);
    
    XML::Twig->new( twig_roots          => { qq{spectrum_query[\@spectrum=~/^[^_]*_$target\./]} => 1, },
                    twig_print_outside_roots => $out,
                  )
             ->parsefile( $input);
    

    note that each discarded element will produce an empty line in the output, white space management is tricky. If it matters you can get rid of those with grep -v or by using xml_pp.