I have a large fasta files with various bacterial species names in each of the sequence headers that looks something like this:
file.fasta
>Bacteria;Actinobacteria;Actinobacteria;Streptomyces;Streptomycetaceae;Streptomyces;Streptomyces_sp._AA4;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Amycolatopsis;Amycolatopsis_niigatensis;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
What I would like to do is search through each of the headers for a single species, Streptomyces, and replace the entire header with just "Streptomyces" if it is listed, else replace the entire header "Not Streptomyces":
new_file.fasta
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
My first instinct is to use something like awk or sed to do this replacement, but I run into trouble figuring out how to replace the entire string.
How should I go about this?
In any awk you can do:
awk '/^>/{
s="Not Streptomyces"
n=split($0,fields,";")
for(i=1;i<=n;i++) if (fields[i]=="Streptomyces") s="Streptomyces"
$0=">" s
} 1
' file
Or with GNU awk for the word boundary regex:
gawk '/^>/ {
if ($0~/\<Streptomyces\>/)
$0="Streptomyces"
else
$0="Not Streptomyces"
}
1
' file
Or more tersely:
gawk '/^>/ { $0=">" ($0~/\<Streptomyces\>/ ? "" : "Not ") "Streptomyces" }1' file
Or if you can rely that the start is always >Bacteria;
and the line always ends in ;
(as in your example) then you can do (in any awk):
awk '/^>/ { $0=">" ($0~/;Streptomyces;/ ? "" : "Not ") "Streptomyces" } 1' file
Ruby:
ruby -lpe 'if /^>/ then $_ = /\bStreptomyces\b/ ? ">Streptomyces" : ">Not Streptomyces" end' file
ANy of those prints:
>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG