loopsbiopythonncbi

Change ID in multiple FASTA files


I need to rename multiple sequences in multiple fasta files and I found this script in order to do so for a single ID:


original_file = "./original.fasta"
corrected_file = "./corrected.fasta"

with open(original_file) as original, open(corrected_file, 'w') as corrected:
    records = SeqIO.parse(original_file, 'fasta')
    for record in records:
        print record.id            
        if record.id == 'foo':
            record.id = 'bar'
            record.description = 'bar' # <- Add this line
        print record.id 
        SeqIO.write(record, corrected, 'fasta') 

Each fasta file corresponds to a single organism, but it is not specified in the IDs. I have the original fasta files (because these have been translated) with the same filenames but different directories and include in their IDs the name of each organism. I wanted to figure out how to loop through all these fasta files and rename each ID in each file with the corresponding organism name.


Solution

  • ok my effort, got to use my own input folders/files since they where not specified in question

    /old folder contains files :

    MW628877.1.fasta :

    >MW628877.1 Streptococcus agalactiae strain RYG82 DNA gyrase subunit A (gyrA) gene, complete cds
    ATGCAAGATAAAAATTTAGTAGATGTTAATCTAACTAGTGAAATGAAAACGAGTTTTATCGATTACGCCA
    TGAGTGTCATTGTTGCTCGTGCACTTCCAGATGTTAGAGATGGTTTAAAACCTGTTCATCGTCGTATTTT
    >KY347969.1 Neisseria gonorrhoeae strain 1448 DNA gyrase subunit A (gyrA) gene, partial cds
    CGGCGCGTACCGTACGCGATGCACGAGCTGAAAAATAACTGGAATGCCGCCTACAAAAAATCGGCGCGCA
    TCGTCGGCGACGTCATCGGTAAATACCACCCCCACGGCGATTTCGCAGTTTACGGCACCATCGTCCGTAT
    
    

    MG995190.1.fasta :

    >MG995190.1 Mycobacterium tuberculosis strain UKR100 GyrA (gyrA) gene, complete cds
    ATGACAGACACGACGTTGCCGCCTGACGACTCGCTCGACCGGATCGAACCGGTTGACATCCAGCAGGAGA
    TGCAGCGCAGCTACATCGACTATGCGATGAGCGTGATCGTCGGCCGCGCGCTGCCGGAGGTGCGCGACGG
    

    and an /empty folder.

    /new folder contains files :

    MW628877.1.fasta :

    >MW628877.1
    MQDKNLVDVNLTSEMKTSFIDYAMSVIVARALPDVRDGLKPVHRRI
    >KY347969.1
    RRVPYAMHELKNNWNAAYKKSARIVGDVIGKYHPHGDFAVYGTIVR
    

    MG995190.1.fasta :

    >MG995190.1
    MTDTTLPPDDSLDRIEPVDIQQEMQRSYIDYAMSVIVGRALPEVRD
    

    my code is :

    from Bio import SeqIO
    from os import scandir
    old = './old'
    
    new = './new'
    
    
    old_ids_dict = {}
    
    for filename in scandir(old):
        
        if filename.is_file():
            
            print(filename)
            
            for seq_record in SeqIO.parse(filename, "fasta"):
                
                
                old_ids_dict[seq_record.id] = ' '.join(seq_record.description.split(' ')[1:3])
                
    print('_____________________')
    
    print('old ids ---> ',old_ids_dict)
    
    print('_____________________')
    
    for filename in scandir(new):
        
        if filename.is_file():
            
            sequences = []
            
            for seq_record in SeqIO.parse(filename, "fasta"):
    
                if seq_record.id in old_ids_dict.keys():
                    
                    print('@@@ ', seq_record.id,'    ', old_ids_dict[seq_record.id])
                    
                    seq_record.id += '.'+old_ids_dict[seq_record.id]
                    
                    seq_record.description = ''
                    
                    print('-->', seq_record.id)
                    
                
                print(seq_record)
                
                sequences.append(seq_record)
            
            SeqIO.write(sequences, filename, 'fasta') 
    
    

    check how it works, it actually overwrites both files in new folder,

    as pointed out by @Vovin in his comment it needs to be adapted per your files template from-to.

    I am sure there is more than a way to do this, probably better and more pythonic than may way, I am learning too. Let us know