bashawksequencerenamespp

replace the header line of several sequences in a fasta file and replace them with the species names stored in a list (.txt)


I have a fasta file with several sequences, but the first line of all the sequences start with the same string (ABI) and I want to change and replace it with the names of the species stored in a different text file.

My fasta file looks like

>ABI
AGCTAGTCCCGGGTTTATCGGCTATAC
>ABI
ACCCCTTGACTGACATGGTACGATGAC
>ABI
ATTTCGACTGGTGTCGATAGGCAGCAT
>ABI
ACGTGGCTGACATGTATGTAGCGATGA

The list of spp looks like this:

Alsophila cuspidata
Bunchosia argentea
Miconia cf.gracilis
Meliosma frondosa

How I can change those ABI headers of my sequences and replace them with the name of my species using that exact order.

Required output:

>Alsophila cuspidata
AGCTAGTCCCGGGTTTATCGGCTATAC
>Bunchosia argentea
ACCCCTTGACTGACATGGTACGATGAC
>Miconia cf.gracilis
ATTTCGACTGGTGTCGATAGGCAGCAT
>Meliosma frondosa
ACGTGGCTGACATGTATGTAGCGATGA

I was using something like:

awk '
FNR==NR{
  a[$1]=$2
  next
}
($2 in a) && /^>/{
  print ">"a[$2]
  next
}
1
' spp_list.txt FS="[> ]"  all_spp.fasta

This is not working, could someone guide me please.


Solution

  • Hello, not a dev so don't be rude.

    Hope this will help you:

    I create a file fasta.txt that contains:

    >ABI
    AGCTAGTCCCGGGTTTATCGGCTATAC
    >ABI
    ACCCCTTGACTGACATGGTACGATGAC
    >ABI
    ATTTCGACTGGTGTCGATAGGCAGCAT
    >ABI
    ACGTGGCTGACATGTATGTAGCGATGA
    

    I also created a file spplist.txt that contains:

    Alsophila cuspidata
    Bunchosia argentea
    Miconia cf.gracilis
    Meliosma frondosa
    

    I then created a python script named fasta.py, here it is:

    #!/bin/python3
    
    #import re library: https://docs.python.org/3/library/re.html
    #import sys library: https://docs.python.org/3/library/sys.html
    import re,sys
    
    #saving the reference of the standard output into "original_stdout"
    original_stdout = sys.stdout
    
    
    with open("spplist.txt", "r") as spplist:
        x = spplist.readlines()
        with open("fasta.txt", "r") as fasta:
            output_file = open("output.txt", "w")
            #redirecting standard output to output_file
            sys.stdout = output_file
    
            for line in fasta:
                if re.match(r">ABI", line):
                    print(x[0].rstrip())
                    del x[0]
                else:
                    print(line.rstrip())
    
            #restoring the native standard output
            sys.stdout = original_stdout
    
    #Notify the user at the end of the work
    print("job done")
    

    (these three file need to be in the same directory if you want the script to work as it is)

    Here is my directoy tree:

    ❯ tree
    .
    ├── fasta.py
    ├── fasta.txt
    └── spplist.txt
    

    To execute the script, open a shell, cd in the directory and type:

    ❯ python3 fasta.py
    job done
    

    You will see a new file named output.txt in the directory:

    ❯ tree
    .
    ├── fasta.py
    ├── fasta.txt
    ├── output.txt
    └── spplist.txt
    

    and here is its content:

    Alsophila cuspidata
    AGCTAGTCCCGGGTTTATCGGCTATAC
    Bunchosia argentea
    ACCCCTTGACTGACATGGTACGATGAC
    Miconia cf.gracilis
    ATTTCGACTGGTGTCGATAGGCAGCAT
    Meliosma frondosa
    ACGTGGCTGACATGTATGTAGCGATGA
    

    Hope this can help you out. bguess.