bashawksedbioinformaticsfasta

How should I go about implementing conditional string replacements in a fasta file?


I have a large fasta files with various bacterial species names in each of the sequence headers that looks something like this:

file.fasta

>Bacteria;Actinobacteria;Actinobacteria;Streptomyces;Streptomycetaceae;Streptomyces;Streptomyces_sp._AA4;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Bacteria;Actinobacteria;Actinobacteria;Pseudonocardiales;Pseudonocardiaceae;Amycolatopsis;Amycolatopsis_niigatensis;
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG

What I would like to do is search through each of the headers for a single species, Streptomyces, and replace the entire header with just "Streptomyces" if it is listed, else replace the entire header "Not Streptomyces":

new_file.fasta

>Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
>Not Streptomyces
TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG

My first instinct is to use something like awk or sed to do this replacement, but I run into trouble figuring out how to replace the entire string.

How should I go about this?


Solution

  • In any awk you can do:

    awk '/^>/{
            s="Not Streptomyces"
            n=split($0,fields,";")
            for(i=1;i<=n;i++) if (fields[i]=="Streptomyces") s="Streptomyces"
            $0=">" s
    } 1
    ' file
    

    Or with GNU awk for the word boundary regex:

    gawk '/^>/ { 
                if ($0~/\<Streptomyces\>/) 
                    $0="Streptomyces"
                else 
                    $0="Not Streptomyces"
                }
    1
    ' file
    

    Or more tersely:

    gawk '/^>/ { $0=">" ($0~/\<Streptomyces\>/ ? "" : "Not ") "Streptomyces" }1' file
    

    Or if you can rely that the start is always >Bacteria; and the line always ends in ; (as in your example) then you can do (in any awk):

    awk '/^>/ { $0=">" ($0~/;Streptomyces;/ ? "" : "Not ") "Streptomyces"  } 1' file
    

    Ruby:

    ruby -lpe 'if /^>/ then $_ = /\bStreptomyces\b/ ? ">Streptomyces" : ">Not Streptomyces" end' file 
    

    ANy of those prints:

    >Streptomyces
    TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG
    >Not Streptomyces
    TTGGCAGTCTCTCCCGCGAACCAGGCCACTGCTGCGACCACCTCGGCTGAATCCCGCGCGCAGGCCACGGGAATCCCCGG