bashawkfasta

Sequence length of FASTA file


I have the following FASTA file:

>header1
CGCTCTCTCCATCTCTCTACCCTCTCCCTCTCTCTCGGATAGCTAGCTCTTCTTCCTCCT
TCCTCCGTTTGGATCAGACGAGAGGGTATGTAGTGGTGCACCACGAGTTGGTGAAGC
>header2
GGT
>header3
TTATGAT

My desired output:

>header1
117
>header2
3
>header3
7
# 3 sequences, total length 127.

This is my code:

awk '/^>/ {print; next; } { seqlen = length($0); print seqlen}' file.fa

The output I get with this code is:

>header1
60
57
>header2
3
>header3
7

I need a small modification in order to deal with multiple sequence lines.

I also need a way to have the total sequences and total length. Any suggestion will be welcome... In bash or awk, please. I know that is easy to do it in Perl/BioPerl and actually, I have a script to do it in those ways.


Solution

  • An awk / gawk solution can be composed by three stages:

    1. Every time header is found these actions should be performed:

      • Print previous seqlen if exists.
      • Print tag.
      • Initialize seqlen.
    2. For the sequence lines we just need to accumulate totals.
    3. Finally at the END stage we print the remnant seqlen.

    Commented code:

    awk '/^>/ { # header pattern detected
            if (seqlen){
             # print previous seqlen if exists 
             print seqlen
             }
    
             # pring the tag 
             print
    
             # initialize sequence
             seqlen = 0
    
             # skip further processing
             next
          }
    
    # accumulate sequence length
    {
    seqlen += length($0)
    }
    # remnant seqlen if exists
    END{if(seqlen){print seqlen}}' file.fa
    

    A oneliner:

    awk '/^>/ {if (seqlen){print seqlen}; print ;seqlen=0;next; } { seqlen += length($0)}END{print seqlen}' file.fa
    

    For the totals:

    awk '/^>/ { if (seqlen) {
                  print seqlen
                  }
                print
    
                seqtotal+=seqlen
                seqlen=0
                seq+=1
                next
                }
        {
        seqlen += length($0)
        }     
        END{print seqlen
            print seq" sequences, total length " seqtotal+seqlen
        }' file.fa