pythonloopsoptimizationfasta

Optimizing for loop for using dictionary values to format new fasta files


I have a for loop that does what I need, however, Im wondering if the loop could be further optimized.

I have a dictionary, where each key has a list of values associated with it. The list of values are basically accession numbers. For each key, id like to loop through my fasta file, and create a new fasta file with only the sequences whos accession number are in the list of values for each given key.

The loop I have so far is as fallows...

seqs=open('file_name.fasta', 'r')
seq_lines=seqs.readlines()

for i in d.keys():
    fasts=open(i,'w')
    for index, line in enumerate(seq_lines):
        for j in d[i]: #loops thru the accession number (ie lis of values for each key) 
            if (j in line): #checks to see if acc number is in fasta file
                header=seq_lines[index] #the header from fasta file
                nuc=seq_lines[index+1] # The sequence after each header
                fasts.write(header.rstrip()+'\n'+nuc)

d is my dictionary and seq_lines is the fasta file. This loop works fine, however it is slow when there are lots of sequences for any given key (the fasta file has around 50000 seq in total). Appreciate any input.

Example of data

This is the dictionary with the list of values d={'range1': ['MZ193866', 'OK234148'], 'range2': ['MZ371332', 'MZ315819']}

Fasta file, with the values in the header (its a text file, not sure how to upload it directly)

seq='>MZ193866 \n ATGTTTTATTATTT \n >MZ371332 \n GGGGCTTCTATTTCCA \n > OK234148 \n AGGGCTTTTAAAAA \n >MZ315819 \n ATTTTCCCCGGGGAAAAA'

By the end Id like to have file names after the keys, containing only the sequences that are in its list of values ...

range1 = '>MZ193866 \n ATGTTTTATTATTT \n > OK234148 \n AGGGCTTTTAAAAA'

range2 = '>MZ371332 \n GGGGCTTCTATTTCCA \n >MZ315819 \n ATTTTCCCCGGGGAAAAA'


Solution

  • You can group by each header, extract the accession number and use that as an index to the fasta records. Now, instead of scanning all records for each number of interest, you can lookup in the index.

    Here, I've used a regular expression to parse the header - it's very specific to this one fasta format. A fasta with multiple values would be more complex.

    from itertools import groupby
    import re
    
    # single column fasta header with optional whitespace before >
    hdr_match = re.compile(r"\s*>").match
    hdr_extract = re.compile(r"\s*>\s*([^\s]+)").match
    
    # test data
    d={'range1': ['MZ193866', 'OK234148'], 'range2': ['MZ371332', 'MZ315819']}
    
    open("file_name_test.fasta", "w").write(""">MZ193866 
     ATGTTTTATTATTT 
     >MZ371332 
     GGGGCTTCTATTTCCA 
     > OK234148 
     AGGGCTTTTAAAAA 
     >MZ315819 
     ATTTTCCCCGGGGAAAAA
     """)
    
    # fasta header is a single (whatever its called)
    # e.g., " >MZ193866 \n" (note the non-conforming space starting some headers)
    
    fasta_index = {}
    
    for is_hdr, lines in groupby(open('file_name_test.fasta'), hdr_match):
        if is_hdr:
            hdr = hdr_extract(next(lines)).group(1)
        else:
            fasta_index[hdr] = "".join(line.strip() for line in lines)
    
    for name, values in d.items():
        with open(name, "w") as fasts:
            for value in values:
                if value in fasta_index:
                    fasts.write(f">{value}\n{fasta_index[value]}\n")
    
    print("------- output -------")
    for name in d:
        print(name)
        print(open(name).read())
        print("--------")