I have a fasta file:
>1
AGGGTCACGTAATGCTGATCCAGTCTTGTTTTTATTTTCATTCATGTTCCCGCTCTTGCT
TTGATTCCGACTTCTAACGTTTAACCTGTGATCAGACGTTTCACTGCTCCATATTTTACG
TGTGCCTGCCGGTCATCTTGGGTAGAGTTAGCATATCC
>2
GTTTGGAAAACCTTGAGAACTTGGCTGAGCAACTAGGAGATAGGCGTATAAAGACTATCG
GCTTTGTTCTCGAAAAAATTCAATCAATTTTCGAGCATTCTTATCGCAGAATTGTTGAAT
>3
ACTCATG
Where the actual number of lines following each ">" can be thousands or even millions long. In this example, there are 158 letters (and 3 lines) following the >1, 2 lines and 120 characters after the >2, and 1 line of 7 characters after the >3.
I would like to have output to be something like:
>1
3 158
>2
2 120
>3
1 7
(format isn't critical as long as two pieces of information - the number of lines and the number of characters) is there.
I have been using a python script to split these files by the >
and then count the number of lines and characters between each >
. However, the files are very large and the python script takes a long time to run. Is there a simple way to do this using awk
or something else in the Linux command line?
Here's an other awk
idea that handles empty sequences, empty lines and comments in FASTA files:
awk '
/^;/ { next }
/^>/ {
if (substr(seq,1,1) == "\n")
print gsub(/\n/,"",seq)-1, length(seq);
seq = "\n";
print;
next
}
{ seq = seq $0 "\n" }
END {
if (substr(seq,1,1) == "\n")
print gsub(/\n/,"",seq)-1, length(seq);
}
' file.fasta
>1
3 158
>2
2 120
>3
1 7