[SOLVED] Change ID in multiple FASTA files

Change ID in multiple FASTA files

I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:


original_file = "./original.fasta"
corrected_file = "./corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id            
        if record.id == 'foo':
            record.id = 'bar'
            record.description = 'bar' # <- Add this line
        print record.id 
        SeqIO.write(record, corrected, 'fasta')

Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism. I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.

Solution

ok my effort, got to use my own input folders/files since they where not specified in question

/old folder contains files :

MW628877.1.fasta :

>MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
>KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT

MG995190.1.fasta :

>MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG

and an /empty folder.

/new folder contains files :

MW628877.1.fasta :

>MW628877.1
MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
>KY347969.1
RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR

MG995190.1.fasta :

>MG995190.1
MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD

my code is :

from Bio import SeqIO
from os import scandir
old = './old'

new = './new'


old_ids_dict = {}

for filename in scandir(old):
    
    if filename.is_file():
        
        print(filename)
        
        for seq_record in SeqIO.parse(filename, "fasta"):
            
            
            old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
            
print('_____________________')

print('old ids ---> ',old_ids_dict)

print('_____________________')

for filename in scandir(new):
    
    if filename.is_file():
        
        sequences = []
        
        for seq_record in SeqIO.parse(filename, "fasta"):

            if seq_record.id in old_ids_dict.keys():
                
                print('@@@ ', seq_record.id,'    ', old_ids_dict[seq_record.id])
                
                seq_record.id += '.'+old_ids_dict[seq_record.id]
                
                seq_record.description = ''
                
                print('-->', seq_record.id)
                
            
            print(seq_record)
            
            sequences.append(seq_record)
        
        SeqIO.write(sequences, filename, 'fasta')

check how it works, it actually overwrites both files in new folder,

as pointed out by @Vovin in his comment it needs to be adapted per your files template from-to.

I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know