unixawksedfasta

How to add strings to fasta identifiers


I have a fasta file with several sequences:

grep -e ">" seq.fasta
>mmu_miR_8109 
>mmu_miR_8110 
>mmu_miR_8111 
>mmu_miR_8112 
>mmu_miR_8113 
>mmu_miR_8114 
>LQNS02136402.1_14821_5p 
>LQNS02278094.1_35771_5p 
>Dpu-Mir-22-P2_LQNS02276481.1_18963_3p 

And I want to add another part to the sequences identifier to make it look like this:

grep -e ">" results.fasta"
>mmu_miR_8109 MOUSE Mus musculus miR_8109
>mmu_miR_8110 MOUSE Mus musculus miR_8110
>mmu_miR_8111 MOUSE Mus musculus miR_8111
>mmu_miR_8112 MOUSE Mus musculus miR_8112
>mmu_miR_8113 MOUSE Mus musculus miR_8113
>mmu_miR_8114 MOUSE Mus musculus miR_8114
>LQNS02136402.1_14821_5p MOUSE Mus musculus 14821_5p
>LQNS02278094.1_35771_5p MOUSE Mus musculus 35771_5p
>Dpu-Mir-22-P2_LQNS02276481.1_18963_3p  MOUSE Mus musculus 18963_3p

Note that MOUSE Mus musculus is alway the same and the last part of each identifier is equal to the last part of column 1 "_to_keep"

So far I have managed to do this:

 grep -e ">" seq.fasta | sed 's/>.*/& MOUSE/' | sed 's/>.*/& Mus musculus/' 

However I am missing the last part (keep the last values) and how to apply this to make the changes in the fasta file. What can I try next?


Solution

  • Here is a simple way with awk, setting FS to underscore seems convenient. When a line is a header, we modify it by adding the fixed string and the last two parts of the existing one, and we print all lines with 1.

    awk -F_ '/>/{$0 = $0 " MOUSE Mus musculus " $(NF-1) FS $NF} 1' file
    

    Output:

    >mmu_miR_8109 MOUSE Mus musculus miR_8109 
    >mmu_miR_8110 MOUSE Mus musculus miR_8110 
    >mmu_miR_8111 MOUSE Mus musculus miR_8111 
    >mmu_miR_8112 MOUSE Mus musculus miR_8112 
    >mmu_miR_8113 MOUSE Mus musculus miR_8113 
    >mmu_miR_8114 MOUSE Mus musculus miR_8114 
    >LQNS02136402.1_14821_5p MOUSE Mus musculus 14821_5p 
    >LQNS02278094.1_35771_5p MOUSE Mus musculus 35771_5p 
    >Dpu-Mir-22-P2_LQNS02276481.1_18963_3p MOUSE Mus musculus 18963_3p 
    

    After you have confirmed that the output is good, you can modify the existing file, like this:

    awk -F_ '/>/{$0 = $0 " MOUSE Mus musculus " $(NF-1) FS $NF} 1' file > file.tmp && mv file.tmp file
    

    Always backup your data before proceeding.