perlmultidimensional-arrayhashprotein-databaseperl-hash

Sort nested hash with multiple conditions


I'm somewhat new to perl programming and I've got a hash which could be formulated like this:

$hash{"snake"}{ACB2}   = [70, 120];
$hash{"snake"}{SGJK}   = [183, 120];
$hash{"snake"}{KDMFS}   = [1213, 120];
$hash{"snake"}{VCS2}   = [21, 120];
...
$hash{"bear"}{ACB2}   = [12, 87];
$hash{"bear"}{GASF}   = [131, 87];
$hash{"bear"}{SDVS}   = [53, 87];
...
$hash{"monkey"}{ACB2}   = [70, 230];
$hash{"monkey"}{GMSD}   = [234, 230];
$hash{"monkey"}{GJAS}   = [521, 230];
$hash{"monkey"}{ASDA}   = [134, 230];
$hash{"monkey"}{ASMD}   = [700, 230];

The structure of the hash is in summary:

%hash{Organism}{ProteinID}=(protein_length, total_of_proteins_in_that_organism)

I would like to sort this hash according to some conditions. First, I would only like to take into consideration those organisms with a total number of proteins higher than 100, then I would like to show the name of the organism as well as the largest protein and its length.

For this, I'm going for the following approach:

    foreach my $org (sort keys %hash) {
        foreach my $prot (keys %{ $hash{$org} }) {
            if ($hash{$org}{$prot}[1] > 100) {
                @sortedarray = sort {$hash{$b}[0]<=>$hash{$a}[0]} keys %hash;

                print $org."\n";
                print @sortedarray[-1]."\n";
                print $hash{$org}{$sortedarray[-1]}[0]."\n"; 
            }
        }
    }

However, this prints the name of the organism as many times as the total number of proteins, for instance, it prints "snake" 120 times. Besides, this is not sorting properly because i guess I should make use of the variables $org and $prot in the sorting line.

Finally, the output should look like this:

snake
"Largest protein": KDMFS [1213]

monkey
"Largest protein": ASMD [700]

Solution

  • All data sorted in print

    use warnings;
    use strict;
    use feature 'say';
    
    use List::Util qw(max);
    
    my %hash;   
    $hash{"snake"}{ACB2}   = [70, 120];
    $hash{"snake"}{SGJK}   = [183, 120];
    $hash{"snake"}{KDMFS}   = [1213, 120];
    $hash{"snake"}{VCS2}   = [21, 120];
    $hash{"bear"}{ACB2}   = [12, 87];
    $hash{"bear"}{GASF}   = [131, 87];
    $hash{"bear"}{SDVS}   = [53, 87];    
    $hash{"monkey"}{ACB2}   = [70, 230];
    $hash{"monkey"}{GMSD}   = [234, 230];
    $hash{"monkey"}{GJAS}   = [521, 230];
    $hash{"monkey"}{ASDA}   = [134, 230];
    $hash{"monkey"}{ASMD}   = [700, 230];
    
    my @top_level_keys_sorted = 
        sort {   
            ( max map { $hash{$b}{$_}->[0] } keys %{$hash{$b}} ) <=> 
            ( max map { $hash{$a}{$_}->[0] } keys %{$hash{$a}} )
        }   
        keys %hash;
    
    for my $k (@top_level_keys_sorted) {
        say $k; 
        say "\t$_ --> @{$hash{$k}{$_}}" for 
            sort { $hash{$k}{$b}->[0] <=> $hash{$k}{$a}->[0] } 
            keys %{$hash{$k}};
    }
    

    This first sorts the top-level keys by the first number in the arrayref value, per requirement. With that sorted list of keys on hand we then go inside each key's hashref and sort further. That loop is what we'd tweak to limit output as wanted (first 100 by total number, only largest by length, etc).

    It prints

    snake
            KDMFS --> 1213 120
            SGJK --> 183 120
            ACB2 --> 70 120
            VCS2 --> 21 120
    monkey
            ASMD --> 700 230
            GJAS --> 521 230
            GMSD --> 234 230
            ASDA --> 134 230
            ACB2 --> 70 230
    bear
            GASF --> 131 87
            SDVS --> 53 87
            ACB2 --> 12 87
    

    I can't tell whether output should show all of "organisms with a total number of proteins higher than 100" (text) or only the largest one (desired output) so I am leaving all of it. Cut if off as needed. To get only the largest one either compare max from each key in the loop or see this post (same problem).

    Note that a hash itself cannot be "sorted" as it is inherently unordered. But we can print things out sorted, as above, or generate ancillary data structures which can be sorted, if needed.