(cross-posted to Biostars: https://www.biostars.org/p/9562110/)
I have a fasta file anotated and I want to add to the first position after > the next word to 'Similar to'
>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
I want the output to be like this
>Chid1_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
How can i do it? i already tried with
sed -E 's/(Similar to )(\w+)/>CHIA_\2\1\2/' file.txt > new_file_2.txt
and store it in a new file and tried to paste it into the headers but it does not work , any ideas?
And also with a python script
def extract_similar_to_word(line):
words = line.split()
for i, word in enumerate(words):
if word == "Similar":
similar_to_word = words[i + 2].strip('""')
if i + 3 < len(words) and words[i + 3].strip('""')[0].isupper():
similar_to_word = words[i + 1].strip('""') + words[i + 2].strip('""')
return similar_to_word
return None
def modify_fasta_headers(input_file, output_file):
with open(input_file, "r") as in_file, open(output_file, "w") as out_file:
for line in in_file:
if line.startswith(">"):
similar_to_word = extract_similar_to_word(line)
if similar_to_word:
# Find the first space in the line, then insert the similar_to_word
first_space_index = line.find(" ")
line = ">" + similar_to_word + "_" + line[1:first_space_index] + line[first_space_index:]
out_file.write(line)
input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"
modify_fasta_headers(input_file, output_file)
For python
you can use Biopython SeqIO
to rename your files:
from Bio import SeqIO
def extract_similar_to(line):
data = line.split('Similar to ')
if len(data) > 1:
return data[1].split(' ')[0] + "_" + line
else:
return line
def modify_fasta_headers(input_file, output_file):
with open(output_file, "w") as outputs:
for r in SeqIO.parse(input_file, "fasta"):
# Rewrite description with similar to word
r.description = extract_similar_to(r.description)
# Remove old id
r.id = ''
SeqIO.write(r, outputs, "fasta")
input_file = "all_chias.fasta"
output_file = "modified_output_fasta_v1.fasta"
modify_fasta_headers(input_file, output_file)
The extract_similar_to()
was rewritten so that it splits the line by the keyword Similar to. For the example that you gave:
>_Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACCCTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTCTCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
The line data = line.split('Similar to ')
would return the following:
['_Anouracaudifer_00017283-RA transcript Name:"', 'Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393']
Since there is the possibility that Similar to is not present in the name, we check that the returned list has length > 1
. If so, we return the first word in the second list element + the original line. Otherwise, we only return the original line.
This is the content of the modified_output_fasta_v1.fasta
file:
> Chid1__Anouracaudifer_00017283-RA transcript Name:"Similar to Chid1 Chitinase domain-containing protein 1 (Rattus norvegicus OX=10116)" offset:0 AED:0.30 eAED:0.30 QI:0|0|0|1|1|1|12|0|393
ATGAAGGCGCTCCTGCATGTGCTCTGGCTCACTCTGGCCTGCGGCTCTGCTCACACCACC
CTGTCGAAGTCGGATGCCAAGAAGTCTGCCTCCAAGACACTGCAGGAGAAGACTCAGCTC
TCAGAGACACCTGTGCAGGACCGGGGTCTGGTGGTAACAGACCCCCGAGCCGAGGACG
Note: The reason the double underscore occurs in the output name is because your example already contained an underscore (I'm not sure if that was intentional):
>_Anouracaudifer_00017283-RA ...