bioinformaticsfastadna-sequencesequence-alignment

How to order multiple Fasta alignment files


I'm sure this is an easy-to-do thing, but I have very limited bioinformatic experience.

I have many -100,000- FASTA files that contain alignments of different genes of the same 12 species. Each file looks something like this:

>dmel
ACTTTTGATACAATTAAC
>dsim
AATCCCAGACAAATTAAG
>dsec
AGTTTTGCAATGGTAAAT
>dere
TGGAATATTAGACGAATT 
...

Not all of the files are ordered in the same way and I would like them all to be. They could be sorted alphabetically if this is easier, it doesn't matter how they are ordered as long as all of the files are sorted equally. Alphabetically should be like:

>dere
TGGAATATTAGACGAATT
>dmel
ACTTTTGATACAATTAAC
>dsec
AGTTTTGCAATGGTAAAT
>dsim
AATCCCAGACAAATTAAG
...

Any script that does this automatically would be much appreciated.

Edit: I have been using a shell script using sed that works but is problematic. It works when the number of files is not that huge but in this particular case it creates duplicated files with different names. The script reads:

#!/bin/bash
echo
for i in {0..114172}; do
#sed -e '$ d' bloque.fasta.trim$i >b0.fasta.trim
#sed -e 's/ /ñ/g' <b0.fasta.trim >b1.fasta.trim
sed -e 's/ /ñ/g' <bloque.fasta.trim$i >b1.fasta.trim
tr "\n" " " <b1.fasta.trim >b2.fasta.trim
sed -e 's/ //g' < b2.fasta.trim >b3.fasta.trim
sed -e 's/>/\n>/g' < b3.fasta.trim >b4.fasta.trim
sed '1d' b4.fasta.trim >b5.fasta.trim
sort b5.fasta.trim >b6.fasta.trim 
sed -e 's/ñ/\n/g' < b6.fasta.trim >b7.fasta.trim$i
done

The non-ordered files are called bloque.fasta.trim, this script creates a bunch of files called b7.fasta.trim$ that should create one b7. file for each bloque. file. The problem is that sometimes it duplicates a file but name them differently. I am sure there most be an easier approach that doesn't make duplication mistakes.


Solution

  • Any script that does this automatically would be much appreciated.

    I don't know if this is exactly what you want, but it's easy to sort fasta files using biopython.

    First, install the module:

    # If using debian/ubuntu
    sudo apt-get install python-biopython
    
    # If other operational system, install pip and
    pip install biopython
    

    Now, write this code in a file, e.g.: fasta_sorter.py

    from Bio import SeqIO
    import sys
    
    infile = sys.argv[1]
    
    records = SeqIO.parse(open(infile, 'r'), 'fasta')
    
    records_dict = SeqIO.to_dict(records)
    
    for rec in sorted(records_dict):
        print ">%s\n%s" % (rec, records_dict[rec].seq)
    

    After that, you can sort each of your files with:

    python fasta_sorter.py /path/to/your.fasta > file.sorted.fasta
    

    You can put it in your for loop.