perlbioinformaticsbioperlfastqsequencing

Processing FASTQ files based on mate pair length


The following files are two mates of a paired-end fastq file, I want to separate each fastq based on their length.

mate1.fq:

@SRR127.1
TGGTTATGATGTTTGTGTAGGAATAGAAATTTTGATTAAGATATTAGTGAAATTTGAATGTAGTTTATTTGGAAGTTATGGAGAGTTTATATTGTATTTATGTTTATTGTTGTAGATTTATATTTATGTGTATATATTAGTTTTTTTGTGT
+
ABAAAF4FFFFFGGGGGGFFGGFGHGFGHHHHHGGCFFGHHHHH5FDBED55DGGFEGFHHHGBHDDHHHFF3AB3FFG5CBGBEF5BD5DGFEGHFAGAFEDGHGFHHGHGEFFGFGGHFEGHHFHGBEBGHHHHGHBHHFHHGGFGHH2
@SRR127.2
TATGGTAAGAAAATTGAAAATTATAAAAAATGAAAAATGTTTATTTGATGATTTGAAAAATGATGAAATTATTGAAAAATGTGAAAAATGAGAAATGTATATTGTAGGATTTGGAATATGGTGAGATAAATGAAAATTATAGTAAATG
+
AABAA5@D4@5CFFCA55FFGGHDGFHFFCC45DGFA2FA5DD55AAAA55DDBDEDDBGGFF5BA5DDABF5D5B5FF1ADFB5EDGHFG5@BFBD55D5FFB@@5@GBGEFBGHHGB@DBBFHFBDG3B43FFH@FGFHH?FHHHH

mate2.fq:

@SRR127.1
ACCTATAAAAAAACCATATCAATAACTATAAAATCTTTATAAAATCCCACCCAATTAAAAAAAAATAAATTAATACATATAAAACCTTAAACACATAAAACATAATCACATACTATATAAACAATTACTATCACTACTAAACACCTAATA
+
>AA?AF13B@D@1EFCGGGFFG3EBGHHHBB2FGHHGHGFDGHHDFEGFHGGGHG1FFF1GGCGGGBGHHHHHFHHHHFHEGGFHF0BD1FGHHAGEGHFHHHFGGFHHGHHHFHHGGFHBGHFED1FBGFGFHDGHGHFGG1GB0GFHH
@SRR127.2
CTATTTCTCATTTTTTTATAATTTTCAATTCTCTTACCATATTCCACATCCTACACTAAACATTTCTAAATTTTCCACCTTTTTCTATTTTTCTCACCATATTTCATATCCTAAAAAACATATTCCTCATTTACTATAATTTTCAATTATC
+
11>>AFFDFF3@FFF?EFFGFBGHFDFA33D2FF2GGHFE12DD221AF1F1E1BG1GGBFBGGEGHDAABGAGDFABGG1BBDF12A2@2BG@2@DEFFF2B2@2222BB2211FGEE/11@22B2>1B22F2>GBGBD22BGD2>2B22

I wrote the following code to do this but I get a strange error only for the second file (mate2.fq) while both of them also have 151 bp reads.

#!/usr/bin/perl

use strict;
use warnings;

my @fh;

my $file_name = $ARGV[0];
my $infile    = $ARGV[1];

#convert every 4-line fastq to 1-line
open(FH, "cat '$infile' | awk '{printf \"%s%s\",\$0,(NR%4?FS:RS)}' | ");

while (<FH>) {
  chomp;

  my @line = split(/\s+/, $_);
  my $len  = length($line[1]);

  if ($len >= 100) {

    #print $len,"\n",$_,"\n";
    push @fh, $len;

    if (not defined $fh[$len]) {
      open $fh[$len], '>', "$file_name\_$len";
    }
    print { $fh[$len] } (join("\n", @line), "\n");
  }

}

Error:

Can't use string ("151") as a symbol ref while "strict refs" in use at

How can I process these files?


Solution

  • As you have read, your problem is because of a spurious push that adds an integer value to the end of the @fh array. I presume you were aiming to extend the array to be long enough to add the new file handle. You can do that by assigning to $#fh, so you would write $#fh = $len if $#fh < $len; however it is unnecessary because Perl will extend arrays automatically for you when you simply assign to an element off the end of the array

    I have a couple of comments on your program that I hope you find useful

    This is how I would write your code. It accumulates the input records into $line until four records have been added, and then processes that line as before.

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    my ($file_name, $infile) = @ARGV;
    
    open my $in_fh, '<', $infile or die $!;
    my $line;
    
    my @fh;
    while ( <$in_fh> ) {
      chomp;
      $line .= $_;
    
      if ( $. % 4 == 0 or eof ) {
    
        my @line = split ' ', $line;
        my $len  = length $line[1];
        next if $len < 100;
    
        open $fh[$len], '>', "${file_name}_$len" unless $fh[$len];
        print { $fh[$len] } "$_\n" for @line;
    
        $line = undef;
      }
    }