hashbioinformaticsbioperl

hash of arrays to create unique ids


I want to create unique IDs for file with gene transcripts. Each row consists of transcript_id and intron coordinated in the format: chromosome:start_coord-end_coord:strand. My file looks like this:

CUFF.59321      chr7:134136506-134143748:-
CUFF.59321      chr7:134135655-134136337:-
CUFF.59321      chr7:134134550-134135537:-
CUFF.59321      chr7:134133872-134134471:-
CUFF.59321      chr7:134133246-134133748:-
CUFF.59321      chr7:134132814-134133138:-
CUFF.57276      chr7:25163747-25164818:-
CUFF.57276      chr7:25163469-25163569:-

I want to combine repetitive transcript_ids (column 1) and start-end coordinates for them. Example for CUFF.57276:

CUFF.57276 chr7:25163747-25164818:25163469-25163569:-

For this purpose I used hash of arrays.

#!/usr/bin/perl -w

use strict;

my $input_gtf = shift @ARGV or die $!;

my %hash;

open (FILE, "$input_gtf") or die $!;
while (<FILE>) {
    my $line = $_;
    chomp $line;
    my @array = split /:\s+/, $line;
    my $cuff = $array[0];
    my @introns = $array[1];
    $hash{$cuff} = [@introns];
}
foreach my $cuff(keys %hash) {
    print "$cuff:${hash{$cuff}}\n";
}

close FILE;

However I got the following output:

CUFF.61092      chr8:67968840-67969614:-:ARRAY(0x16a8b10)
CUFF.30258      chr19:16636489-16638890:-:ARRAY(0x15f3b00)
CUFF.47340      chr4:85719262-85722802:-:ARRAY(0x2ae38599de90)

How I can visualize values from ARRAY(0x16a8b10) statement or similar one?


Solution

  • There's no whitespace after : in the input, so $array[1] is empty. Also, you don't want to overwrite $hash{$cuff} for each line, you want to push the new range into the existing array. @{ ... } is the array dereference, which turns an array reference into the array it refers to.

    Here's my version of your script:

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my $input_gtf = shift or die $!;
    
    my %hash;
    
    open my $FILE, $input_gtf or die $!;
    while (my $line = <$FILE>) {
        chomp $line;
        my @array = split /:/, $line;
        my ($cuff, $introns) = @array;
        push @{ $hash{$cuff} }, $introns;
    }
    close $FILE;
    
    for my $cuff (keys %hash) {
        print join ':', $cuff, @{ $hash{$cuff} };
        print "\n";
    }
    

    Unrelated changes I made to the code: