pythonbiopythongenome

Loading a huge fastq file in biopython with FastqGeneralIterator


I have a 25GB fastq file that I want to load with biopython. Of course, I don't want to work with loading the entire file in memory, which would be impossible. Looking at the documentation, I can see that there is a FastqGeneralIterator method to handle this case. And, indeed, this goes super fast (0.2s):

from Bio.SeqIO.QualityIO import FastqGeneralIterator
from itertools import islice

fastq_gen = FastqGeneralIterator(open('sequence.fastq'))
seq_load = list(islice(fastq_gen, 100))

Unfortunately, I don't see, from the docs, how I can load 100 sequences into SeqRecord objects, like this would do (for the entire file):

records = [rec for rec in SeqIO.parse('sequence.fastq', format='fastq')]

Now, there's also this method described here:

input_seq_iterator = SeqIO.parse("sequence.fastq", "fastq")
short_seq_iterator = (record for record in input_seq_iterator if len(record.seq) < 100)
records = [rec for rec in short_seq_iterator]

But, even for 100 records, it takes ages. I have interrupted it after several minutes.

So, how can I get SeqRecord objects from the FastqGeneralIterator, or from any other method with similarly fast execution?


Solution

  • not sure got your question right, here my attempt:

    sequence.fastq file:

    @ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z
    GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGACAAAGAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATTATGCATGAACGTAATGCCCATTAGTTGTGAATCCACCATGCGCGGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTGTC
    +
    ##$&$&/035881()'$0&*('-.=;685()$.%($'%%&#&)+..0,&+&%.-/+,%&()$3:0&@09BF=>CC8(78029F7=<=)+@+.6CCFFC@-8%2579<B8;88412134,,;:8./,#1#&(%((09;B=??48<=<@79*-:B540,8=B=444:<571-B5=ED2.56;110.5+,*)%%*
    @105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z
    CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTGACAAGAAAGTTGTCCGGTATCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATGCATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAAGGCAGGGCGACAGGCAAGTCACAAGGAACACCAGACGCTTTGT
    +
    ,&'%/..0013618'.'(*'#(%#$&&%%,-//&#'$&&'&%$('''4+9;7<87756-1-+$$,3%%;)-%$%%$&)''#067897:9$%)(<AC?,(**$&'.6:<=394./41*,12((:?;3/9'(-4<)=99D99BFAC9:;588+.&+&%()%-(59,,($,12+,,*6(-))&-$&0'%)%&%'""%(0((
    @ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z
    GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAAGAAAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAGAAGAGCGACAGGCGGGTACAAAGACGCGGCAACTCTTATGGCGTTAACCGGATCATCGCCGTGACAAAGAAAGATTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTGCTTCCGCGCATGGTGGATTCACAATCCGATGAGCTACGTTCATGCATAACGCAATATCAGCACCAACAGAAACATAGACAACCGGCGACAGCTTTCTTGTC
    +
    )02$3+../*)..2,-&)#$$()70343%'$$$&&%$$),(('*%%').234*,,33%))-2,&9,1<B9:JIH;G>A?3974@A48D119C<5539.-$*&%%%)&&43()*+&&3:493<+1+48.844:66>ABD;>?C;>6:56<<&*&,2.+*%$(1$'%'$),-)/+-%(#$*##&(+*'$$#$#%)*'(&*(,$')'((%%&%0-+.;<))7+%/)7>88>3:B>8;:<@5:333=<*::>=;48726%$%&)6:A>A?75&'(<?;=-/&-07--=:51((,005CCF8:6320$(%%%%(&'00*40.1/16+*/((/3.2($&%,4'''($$33;,5'''1J@@B5''
    
    
    

    code :

    from Bio.SeqIO.QualityIO import FastqGeneralIterator
    
    from itertools import islice
    
    from Bio.SeqRecord import SeqRecord
    
    from Bio.Seq import Seq
    
    sequenz = 'sequence.fastq'
    
    fastq_gen = FastqGeneralIterator(open(sequenz))
    
    seq_load = list(islice(fastq_gen, 3))
    
    records = [SeqRecord(Seq(i[1]), i[0], '', '') for i in seq_load]
    
    print(records, len(records))
    
    print('################')
    
    for i in records:
        print(i,'\n___________')
    
    for i in records:
        
        print(type(i))
        print(i.id)
        print(i.seq,'\n+++++++++++++++')
    

    output:

    [SeqRecord(seq=Seq('GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGA...GTC'), id='ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z', name='', description='', dbxrefs=[]), SeqRecord(seq=Seq('CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTG...TGT'), id='105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z', name='', description='', dbxrefs=[]), SeqRecord(seq=Seq('GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAA...GTC'), id='ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z', name='', description='', dbxrefs=[])] 3
    ################
    ID: ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z
    Number of features: 0
    Seq('GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGA...GTC') 
    ___________
    ID: 105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z
    Number of features: 0
    Seq('CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTG...TGT') 
    ___________
    ID: ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z
    Number of features: 0
    Seq('GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAA...GTC') 
    ___________
    <class 'Bio.SeqRecord.SeqRecord'>
    ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z
    GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGACAAAGAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATTATGCATGAACGTAATGCCCATTAGTTGTGAATCCACCATGCGCGGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTGTC 
    +++++++++++++++
    <class 'Bio.SeqRecord.SeqRecord'>
    105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z
    CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTGACAAGAAAGTTGTCCGGTATCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATGCATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAAGGCAGGGCGACAGGCAAGTCACAAGGAACACCAGACGCTTTGT 
    +++++++++++++++
    <class 'Bio.SeqRecord.SeqRecord'>
    ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z
    GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAAGAAAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAGAAGAGCGACAGGCGGGTACAAAGACGCGGCAACTCTTATGGCGTTAACCGGATCATCGCCGTGACAAAGAAAGATTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTGCTTCCGCGCATGGTGGATTCACAATCCGATGAGCTACGTTCATGCATAACGCAATATCAGCACCAACAGAAACATAGACAACCGGCGACAGCTTTCTTGTC 
    +++++++++++++++
    

    as per https://biopython.org/docs/1.75/api/Bio.SeqIO.QualityIO.html#Bio.SeqIO.QualityIO.FastqGeneralIterator

    Bio.SeqIO.QualityIO.FastqGeneralIterator(handle)

    Iterate over Fastq records as string tuples (not as SeqRecord objects).

    This code does not try to interpret the quality string numerically. It just returns tuples of the title, sequence and quality as strings. For the sequence and quality, any whitespace (such as new lines) is removed.

    same question here (12 years ago):https://www.biostars.org/p/966/