bioinformaticsfastq

sort fastq file and keep sequences 15-17 bp in length


I have a couple very large fastq files that I am using cutadapt to trim off a transposon end sequence from, this should result in 15-17 base pairs of genomic DNA remaining. A very large portion of the fastq files is 15-17 base pairs after using cutadapt, but some sequences are quite a bit longer (indicating they didn't have a transposon end sequence on them and they are garbage reads for my experiment).

My question: is there a command or script I can utilize in Linux in order for me to sort through these fastq files and output a new fastq containing only reads that are 15-17 base pairs long, while still retaining the usual fastq format?

For reference, the fastq format looks like this:

@D64TDFP1:287:C69APACXX:2:1101:1319:2224 1:N:0:
GTTAGACCGGATCCTAACAGGTTGGATGATAAGTCCCCGGTCTAT
+
DDHHHDHHGIHIIIIE?FFHECGHICHHGH>BD?GHIIIIFHIDG
@D64TDFP1:287:C69APACXX:2:1101:1761:2218 1:N:0:
GTTAGACCGGATCCTAACAGGTTGGATGATAAGTCCCCGGTCTAT
+
FFHHHHHJIJJJJJIIJJJIJHIJJGIJIIIFJ?HHJJJJGHIGI

I found a similar question here, but it appears that a correct solution was never found. Does anyone have any solutions?


Solution

  • Read four lines at a time into an array. Print out those four lines, when the read length is between your thresholds.

    Here is an example of how to do that with Perl, but the principle would be the same in Python or any other scripting language:

    #!/usr/bin/env perl
    
    use strict;
    use warnings;
    
    my $fastq;
    my $lineIdx = 0;
    while (<>) {
        chomp;
        $fastq->[$lineIdx++] = $_;
        if ($lineIdx == 4) {
            my $readLength = length $fastq->[1];
            if (($readLength >= 15) && ($readLength <= 17)) {
                print "$fastq->[0]\n$fastq->[1]\n$fastq->[2]\n$fastq->[3]\n";
            }
            $lineIdx = 0;
        }
    }
    

    To use, e.g.:

    $ ./filter_fastq_reads.pl < reads.fq > filtered_reads.fq
    

    This prints out reads in the order they are found. This is just filtering, which should be very fast. Otherwise, if you need to sort on some criteria, please specify the sort criteria in your question.

    In Python:

    #!/usr/bin/env python
    
    import sys
    
    line_idx = 0
    record = []
    for line in sys.stdin:
        record[line_idx] = line.rstrip()
        line_idx += 1
        if line_idx == 4:
            read_length = len(record[1])
            if read_length >= 15 and read_length <= 17:
                sys.stdout.write('{}\n'.format('\n'.join(record)))
            line_idx = 0