I have a 25GB fastq file that I want to load with biopython. Of course, I don't want to work with loading the entire file in memory, which would be impossible. Looking at the documentation, I can see that there is a FastqGeneralIterator
method to handle this case. And, indeed, this goes super fast (0.2s):
from Bio.SeqIO.QualityIO import FastqGeneralIterator
from itertools import islice
fastq_gen = FastqGeneralIterator(open('sequence.fastq'))
seq_load = list(islice(fastq_gen, 100))
Unfortunately, I don't see, from the docs, how I can load 100 sequences into SeqRecord
objects, like this would do (for the entire file):
records = [rec for rec in SeqIO.parse('sequence.fastq', format='fastq')]
Now, there's also this method described here:
input_seq_iterator = SeqIO.parse("sequence.fastq", "fastq")
short_seq_iterator = (record for record in input_seq_iterator if len(record.seq) < 100)
records = [rec for rec in short_seq_iterator]
But, even for 100 records, it takes ages. I have interrupted it after several minutes.
So, how can I get SeqRecord
objects from the FastqGeneralIterator
, or from any other method with similarly fast execution?
not sure got your question right, here my attempt:
sequence.fastq
file:
@ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z
GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGACAAAGAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATTATGCATGAACGTAATGCCCATTAGTTGTGAATCCACCATGCGCGGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTGTC
+
##$&$&/035881()'$0&*('-.=;685()$.%($'%%&#&)+..0,&+&%.-/+,%&()$3:0&@09BF=>CC8(78029F7=<=)+@+.6CCFFC@-8%2579<B8;88412134,,;:8./,#1#&(%((09;B=??48<=<@79*-:B540,8=B=444:<571-B5=ED2.56;110.5+,*)%%*
@105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z
CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTGACAAGAAAGTTGTCCGGTATCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATGCATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAAGGCAGGGCGACAGGCAAGTCACAAGGAACACCAGACGCTTTGT
+
,&'%/..0013618'.'(*'#(%#$&&%%,-//&#'$&&'&%$('''4+9;7<87756-1-+$$,3%%;)-%$%%$&)''#067897:9$%)(<AC?,(**$&'.6:<=394./41*,12((:?;3/9'(-4<)=99D99BFAC9:;588+.&+&%()%-(59,,($,12+,,*6(-))&-$&0'%)%&%'""%(0((
@ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z
GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAAGAAAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAGAAGAGCGACAGGCGGGTACAAAGACGCGGCAACTCTTATGGCGTTAACCGGATCATCGCCGTGACAAAGAAAGATTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTGCTTCCGCGCATGGTGGATTCACAATCCGATGAGCTACGTTCATGCATAACGCAATATCAGCACCAACAGAAACATAGACAACCGGCGACAGCTTTCTTGTC
+
)02$3+../*)..2,-&)#$$()70343%'$$$&&%$$),(('*%%').234*,,33%))-2,&9,1<B9:JIH;G>A?3974@A48D119C<5539.-$*&%%%)&&43()*+&&3:493<+1+48.844:66>ABD;>?C;>6:56<<&*&,2.+*%$(1$'%'$),-)/+-%(#$*##&(+*'$$#$#%)*'(&*(,$')'((%%&%0-+.;<))7+%/)7>88>3:B>8;:<@5:333=<*::>=;48726%$%&)6:A>A?75&'(<?;=-/&-07--=:51((,005CCF8:6320$(%%%%(&'00*40.1/16+*/((/3.2($&%,4'''($$33;,5'''1J@@B5''
code :
from Bio.SeqIO.QualityIO import FastqGeneralIterator
from itertools import islice
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
sequenz = 'sequence.fastq'
fastq_gen = FastqGeneralIterator(open(sequenz))
seq_load = list(islice(fastq_gen, 3))
records = [SeqRecord(Seq(i[1]), i[0], '', '') for i in seq_load]
print(records, len(records))
print('################')
for i in records:
print(i,'\n___________')
for i in records:
print(type(i))
print(i.id)
print(i.seq,'\n+++++++++++++++')
output:
[SeqRecord(seq=Seq('GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGA...GTC'), id='ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z', name='', description='', dbxrefs=[]), SeqRecord(seq=Seq('CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTG...TGT'), id='105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z', name='', description='', dbxrefs=[]), SeqRecord(seq=Seq('GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAA...GTC'), id='ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z', name='', description='', dbxrefs=[])] 3
################
ID: ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z
Number of features: 0
Seq('GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGA...GTC')
___________
ID: 105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z
Number of features: 0
Seq('CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTG...TGT')
___________
ID: ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z
Number of features: 0
Seq('GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAA...GTC')
___________
<class 'Bio.SeqRecord.SeqRecord'>
ee15a423-b008-44be-a4b2-5441d11b0b94 runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5087 ch=53 start_time=2019-10-18T22:14:05Z
GTTGTACTTCGTTCAATCGGTAGGTGTTTAACCGGATGGTCACGCCTACCGTGACAAAGAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCATTATGCATGAACGTAATGCCCATTAGTTGTGAATCCACCATGCGCGGAAGATAGAGCGACAGGCAAGTCACAAAGACACCGACAACTGTC
+++++++++++++++
<class 'Bio.SeqRecord.SeqRecord'>
105220b1-f48a-43b4-8e89-a3cf20afeb0d runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5559 ch=33 start_time=2019-10-18T22:15:11Z
CAAGCATACTTCATTCAGTCAGGCGAAATTATTGCCAGGTCGCCGCCTACCGTGACAAGAAAGTTGTCCGGTATCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATGCATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAAGGCAGGGCGACAGGCAAGTCACAAGGAACACCAGACGCTTTGT
+++++++++++++++
<class 'Bio.SeqRecord.SeqRecord'>
ccdb7b45-8b29-4105-b9ea-f434d0ffc14a runid=fa1d76e661ac2bbb53a002e85e75a30e91827c51 sampleid=1 read=5389 ch=124 start_time=2019-10-18T22:15:41Z
GTTGTACTTCGTTCAGTCGGTGGTGTTTAACTGGGTCATCGCCTACCGTGACAAGAAAGATTGTCGGTGTCTTTGTGTTTCTGTTGGTGCTGATATTGCGTTATATGAACGTAATGCTCATCGGATTGTGAATCCACCATGCGCGGAAGAAGAGCGACAGGCGGGTACAAAGACGCGGCAACTCTTATGGCGTTAACCGGATCATCGCCGTGACAAAGAAAGATTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTGCTTCCGCGCATGGTGGATTCACAATCCGATGAGCTACGTTCATGCATAACGCAATATCAGCACCAACAGAAACATAGACAACCGGCGACAGCTTTCTTGTC
+++++++++++++++
Bio.SeqIO.QualityIO.FastqGeneralIterator(handle)
Iterate over Fastq records as string tuples (not as SeqRecord objects).
This code does not try to interpret the quality string numerically. It just returns tuples of the title, sequence and quality as strings. For the sequence and quality, any whitespace (such as new lines) is removed.
same question here (12 years ago):https://www.biostars.org/p/966/