bashsedfasta

fasta file: replace header with filename


I want to replace all the headers (starting with >) with >{filename}, of all *.fasta files inside my directory AND concatenate them afterwards

content of my directory

speciesA.fasta
speciesB.fasta
speciesC.fasta

example of file, speciesA.fasta

>protein1 description
MJSUNDKFJSKFJSKFJ
>protein2 anothername
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>protein3 somewordshere
KSDAFJLASDJFKLAJFL

my desired output (only for speciesA.fasta now):

>speciesA
MJSUNDKFJSKFJSKFJ
>speciesA
KEFJKSDJFKSDJFKSJFLSJDFLKSJF
>speciesA
KSDAFJLASDJFKLAJFL

This is my code:

for file in *.fasta; do var=$(basename $file .fasta) | sed 's/>.*/>$var/' $var.fasta >>$var.outfile.fasta; done

but all I get is

>$var
MJSUNDKFJSKFJSKFJ
>$var
KEFJKSDJFKSDJFKSJFLSJDFLKSJF

[and so on ...]

Where did i make a mistake??


Solution

  • The bash loop is superfluous. Try:

    awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
    

    This approach is safe even if the file names contain special or regex-active characters.

    How it works

    Example

    Let's consider a directory with two (identical) test files:

    $ cat speciesA.fasta
    >protein1 description
    MJSUNDKFJSKFJSKFJ
    >protein2 anothername
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >protein3 somewordshere
    KSDAFJLASDJFKLAJFL
    $ cat speciesB.fasta
    >protein1 description
    MJSUNDKFJSKFJSKFJ
    >protein2 anothername
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >protein3 somewordshere
    KSDAFJLASDJFKLAJFL
    

    The output of our command is:

    $ awk '/^>/{print ">" substr(FILENAME,1,length(FILENAME)-6); next} 1' *.fasta
    >speciesA
    MJSUNDKFJSKFJSKFJ
    >speciesA
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >speciesA
    KSDAFJLASDJFKLAJFL
    >speciesB
    MJSUNDKFJSKFJSKFJ
    >speciesB
    KEFJKSDJFKSDJFKSJFLSJDFLKSJF
    >speciesB
    KSDAFJLASDJFKLAJFL
    

    The output has the substitutions and concatenates all the input files.