shellawkfasta

Count reads in fasta file (loop)


I would like to count the number of reads in a fasta file using awk. A read in a fasta file starts with "> NAME" followed with a line and "DNA code". In my case, fasta files contain at least 1 read (thus not empty). I also have a lot of fasta files and I would like to loop this. Therefore I would also like to copy the name of the file into the output and paste the number of reads next to the name of the file in the output file.

Fasta file example (File 1.fasta)

>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg

Fasta file example (File 2.fasta)

>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg
>sequence C
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg

I've tried multiple scripts already

Script 1

#!/bin/bash
for file1 in ~/Test/*.fasta
do
outfile1=${file1}_readcount.txt
awk -F ' ' -v out1=$outfile1 '{
if(NR==1) {lines[0]=lines[0] OFS FILENAME  > out1;
}
if(FNR==NR) {grep -c "^>" $file1 > out1;
}
}'  $file1
done

It didn't give an error but also no output

Script 2

awk '
BEGIN   { OFS="\t" } #output field delimiter is tab

FNR==1  { lines[0]=lines[0] OFS FILENAME } #append filename to header record

FNR==NR {grep -c "^>" FILENAME } # counts number of ">" at the beginning of lines

END     { for (i=0;i<=FNR;i++) #loop through the line numbers
              print lines[i]
        } #printing each line 
' *fasta > countreads.txt

Here I only got the header in the file and thousands of empty rows.

Expected ouput that I would like to get

File1   2
File2   3

Solution

  • If you mean to count each /^>/ as a sequence and then summarize sequence counts by filenames, you can do this in awk:

    awk 'FNR==1{if (name) print name, cnt; name=FILENAME; cnt=0}
    /^>/{cnt++}
    END{print name, cnt}' *.fasta
    

    This also works but the file names may be in a different order than awk read them:

    awk '/^>/{file_cnt[FILENAME]++}
    END{for (fn in file_cnt) print fn, file_cnt[fn]}' *.fasta
    

    You can also just use grep directly to count the matches:

    grep -c "^>" *.fasta # that will be ':' delimited. 
                         # You can use sed, tr, awk to 
                         # change the ':' to a '\t` or whatever.
    

    In summary, use awk if you want to customize what is counted or printed. Use grep with a glob if you just want to count the total regex matches and sumarize by file name.