pythonregexbiopythonfastapython-3.12

How to iterate a list created with SeqIO using re.findall, FASTA-input


I have spent way too much time on this now (+10 hours).

Input is a file in fasta-format. Output should be a text-file containing the gene-ID and the matched patterns (three different patterns)

I wanted to make my own function to avoid writing the same code three times, but I've given up now and just written it three times (and works fine).

Is there a way to use this: records = list(SeqIO.parse('mytextfile.fasta', 'fasta')) instead of the code that I'm currently using three times (down below) or some other function? It's for a school assignment so it shouldn't be too complicated either but I have to use the Bio and re-module to solve it.

from Bio import SeqIO
import re

outfile = 'sekvenser.txt'

for seq_record in SeqIO.parse('prot_sequences.fasta', 'fasta'):
    match = re.findall(r'W.P', str(seq_record.seq), re.I)
    if match:       
        with open(outfile, 'a') as f:
            record_string = str(seq_record.id)
            newmatch = str(match)
            result = record_string+'\t'+newmatch
            print(result)
            f.write(result + '\n')

I've tried this

records = list(SeqIO.parse('prot_sequences.fasta', 'fasta'))
new_list = []
i = r'W.P'

for i in records:
    match = re.findall(i)
    if match:       
        new_list.append(match)

print(new_list)

But it only gives me that findall() is missing 1 required positional argument: 'string'.

As I can see it, i is a string (as I made the variable). Obviously I'm doing something wrong. If I try to insert seq_record that I'm using in my other code, it tells me that seq_record isn't defined. I don't understand what I'm supposed to put after the i in the code.


Solution

  • Input prot_sequences.fasta :

    >one
    WWWWWWCCCCCPPPPPP
    >two
    SSSSSSSSRRRRRRRRTTTTWWWWWWDDDDDDMMMMM
    >three
    QQQQQQQQQWAPTCCCCCCCWYPGGGGGGGGGGGGGG
    

    code :

    import Bio
    
    from Bio import SeqIO
    
    import re
    
    
    print('biopython version : ', Bio.__version__)
    
    records = list(SeqIO.parse('prot_sequences.fasta', 'fasta'))
    new_list = []
    i = r'W.P'
    
    #print(i)
    
    for rec in records:
        
        #print('rec : ', rec, rec.seq , type(rec.seq)
        match = re.findall(i, str(rec.seq) )
        if match:       
            new_list.append(match)
    
    print(new_list)
    

    output :

    biopython version :  your Biopython Version
    [['WAP', 'WYP']]
    

    if you uncomment : #print('rec : ', rec, rec.seq , type(rec.seq)

    you'll see that rec.seq it's a <class 'Bio.Seq.Seq'> so not suitable as argument to be feed on re.findall(pattern, string, flags=0)

    records = [str(rec.seq) for rec in SeqIO.parse('prot_sequences.fasta', 'fasta')]
    
    print( records, type(records))
    

    does give you a list of string but you will loose the rec.id needed for your assignment

    See Bio.SeqIO.parse(handle, format, alphabet=None) to get a grasp of the Object properties that:

    Turn a sequence file into an iterator returning SeqRecords.

    you could do:

    records = [[rec.id , re.findall(i, str(rec.seq) , re.I)] for rec in SeqIO.parse('prot_sequences.fasta', 'fasta') if len(re.findall(i, str(rec.seq) , re.I)) >= 1]
    
    print(records)
    

    to get :

    [['three', ['WAP', 'WYP']]]
    

    but I am not sure it's the best way to do that: is actually very difficult to read, I am not sure is fast because (not sure again but it uses re.findall twice see Does Python automatically optimize/cache function calls?); so to get it maybe fastest but even more ugly , use:

    records = [[rec.id , a] for rec in SeqIO.parse('prot_sequences.fasta', 'fasta') if len(a:= re.findall(i, str(rec.seq) , re.I) ) >= 1]
    
    print(records)
    

    out:

    [['three', ['WAP', 'WYP']]]
    

    as per the use of list() , the ones below gives same result:

    records = list([[rec.id , re.findall(i, str(rec.seq) , re.I)] for rec in SeqIO.parse('prot_sequences-2.fasta', 'fasta') if len(re.findall(i, str(rec.seq) , re.I)) >= 1])
    
    print(records)
    
    
    records = list([rec.id , a] for rec in SeqIO.parse('prot_sequences-2.fasta', 'fasta') if len(a:= re.findall(i, str(rec.seq) , re.I) ) >= 1)
    
    print(records)
    

    res:

    [['three', ['WAP', 'WYP']]]
    [['three', ['WAP', 'WYP']]]