perlperl-hash

How to access a nested hash of arrays in a loop?


I have data in this format

a1 1901 4
a1 1902 5
a3 1902 6
a4 1902 7
a4 1903 8
a5 1903 9

I want to calculate the cumulative score (3rd column) for each entity in the first column. So I tried to make a hash and my code looks like this:

use strict;
use warnings;

use Data::Dumper;

my $file = shift;
open (DATA, $file);

my %hash;
while ( my $line = <DATA> ) {
  chomp $line;
  my ($protein, $year, $score) = split /\s+/, $line;
  push @{ $hash{$protein}{$year} }, $score;
}

print Dumper \%hash;

close DATA:

The output looks like this

$VAR1 = {
          'a3' => {
                    '1902' => [
                                5
                              ]
                  },
          'a1' => {
                    '1902' => [
                                6
                              ],
                    '1901' => [
                                4
                              ]
                  },
          'a4' => {
                    '1903' => [
                                8
                              ],
                    '1902' => [
                                7
                              ]
                  },
          'a5' => {
                    '1903' => [
                                9
                              ]
                  }
        };

I now want to access each entity in column 1 (a1,a2,a3) and add the score, so the desired output will be something like this:

a1 1901 4
a1 1902 9    # 4+5
a3 1902 6
a4 1902 7
a4 1903 16   # 7+9
a5 1903 9

But I am unable to come up with how to access the values of the created hash in a loop in order to add the values?


Solution

  • If the data is always sorted as you show it then you can process the data as you read it from the file:

    while ( <DATA> ) {
        my ($protein, $year, $score) = split;
    
        $total = 0 unless $protein eq $current;
        $total += $score;
    
        print "$protein $year $total\n";
    
        $current = $protein;
    }
    

    output

    a1 1901 4
    a1 1902 9
    a3 1902 6
    a4 1902 7
    a4 1903 15
    a5 1903 9