awkbioinformaticsfasta

Count the number of lines and characters in a file separated by a specific character


I have a fasta file:

>1
AGGGTCACGTAATGCTGATCCAGTCTTGTTTTTATTTTCATTCATGTTCCCGCTCTTGCT
TTGATTCCGACTTCTAACGTTTAACCTGTGATCAGACGTTTCACTGCTCCATATTTTACG
TGTGCCTGCCGGTCATCTTGGGTAGAGTTAGCATATCC
>2
GTTTGGAAAACCTTGAGAACTTGGCTGAGCAACTAGGAGATAGGCGTATAAAGACTATCG
GCTTTGTTCTCGAAAAAATTCAATCAATTTTCGAGCATTCTTATCGCAGAATTGTTGAAT
>3
ACTCATG

Where the actual number of lines following each ">" can be thousands or even millions long. In this example, there are 158 letters (and 3 lines) following the >1, 2 lines and 120 characters after the >2, and 1 line of 7 characters after the >3.

I would like to have output to be something like:

>1
3 158
>2
2 120
>3
1 7

(format isn't critical as long as two pieces of information - the number of lines and the number of characters) is there.

I have been using a python script to split these files by the > and then count the number of lines and characters between each >. However, the files are very large and the python script takes a long time to run. Is there a simple way to do this using awk or something else in the Linux command line?


Solution

  • Here's an other awk idea that handles empty sequences, empty lines and comments in FASTA files:

    awk '
        /^;/ { next }
        /^>/ {
            if (substr(seq,1,1) == "\n")
                 print gsub(/\n/,"",seq)-1, length(seq);
            seq = "\n";
            print;
            next
        }
        { seq = seq $0 "\n" }
        END {
            if (substr(seq,1,1) == "\n")
                print gsub(/\n/,"",seq)-1, length(seq);
        }
    ' file.fasta
    
    >1
    3 158
    >2
    2 120
    >3
    1 7