Change a fasta header with the next word after 'similar to'

(cross-posted to Biostars: https://www.biostars.org/p/9562110/)

I have a fasta file anotated and I want to add to the first position after > the next word to 'Similar to'

>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

I want the output to be like this

>Chid1_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

How can i do it? i already tried with

sed -E 's/(Similar to )(\w+)/>CHIA_\2\1\2/' file.txt > new_file_2.txt

and store it in a new file and tried to paste it into the headers but it does not work , any ideas?

And also with a python script

def extract_similar_to_word(line):
    words = line.split()
    for i, word in enumerate(words):
        if word == "Similar":
            similar_to_word = words[i + 2].strip('""')
            if i + 3 < len(words) and words[i + 3].strip('""')[0].isupper():
                similar_to_word = words[i + 1].strip('""') + words[i + 2].strip('""')
            return similar_to_word
    return None

def modify_fasta_headers(input_file, output_file):
    with open(input_file, "r") as in_file, open(output_file, "w") as out_file:
        for line in in_file:
            if line.startswith(">"):
                similar_to_word = extract_similar_to_word(line)
                if similar_to_word:
                    # Find the first space in the line, then insert the similar_to_word
                    first_space_index = line.find(" ")
                    line = ">" + similar_to_word + "_" + line[1:first_space_index] + line[first_space_index:]
            out_file.write(line)



input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"

modify_fasta_headers(input_file, output_file)

Solution

For python you can use Biopython SeqIO to rename your files:

from Bio import SeqIO

def extract_similar_to(line):
    data = line.split('Similar to ')
    if len(data) > 1:
        return data[1].split(' ')[0] + "_" + line
    else:
        return line

def modify_fasta_headers(input_file, output_file):
    with open(output_file, "w") as outputs:
        for r in SeqIO.parse(input_file, "fasta"):
            # Rewrite description with similar to word
            r.description = extract_similar_to(r.description)
            # Remove old id
            r.id = ''
            SeqIO.write(r, outputs, "fasta")


input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"

modify_fasta_headers(input_file, output_file)

The extract_similar_to() was rewritten so that it splits the line by the keyword Similar to. For the example that you gave:

>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

The line data = line.split('Similar to ') would return the following:

['_Anouracaudifer_00017283-RA transcript Name:"', 'Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393']

Since there is the possibility that Similar to is not present in the name, we check that the returned list has length > 1. If so, we return the first word in the second list element + the original line. Otherwise, we only return the original line.

This is the content of the modified_output_fasta_v1.fasta file:

> Chid1__Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACC
CTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTC
TCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG

Note: The reason the double underscore occurs in the output name is because your example already contained an underscore (I'm not sure if that was intentional):

>_Anouracaudifer_00017283-RA ...