I would like to count the number of reads in a fasta file using awk. A read in a fasta file starts with "> NAME" followed with a line and "DNA code". In my case, fasta files contain at least 1 read (thus not empty). I also have a lot of fasta files and I would like to loop this. Therefore I would also like to copy the name of the file into the output and paste the number of reads next to the name of the file in the output file.
Fasta file example (File 1.fasta)
>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg
Fasta file example (File 2.fasta)
>sequence A
ggtaagtcctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagaaaaaacacttttaggctatattagagccatcttctttgaagcgttgtc
>sequence B
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg
>sequence C
ggtaagtgctctagtacaaacacccccaatattgtgatataattaaaattatattcatat
tctgttgccagattttacacttttaggctatattagagccatcttctttgaagcgttgtc
tatgcatcgatcgacgactg
I've tried multiple scripts already
Script 1
#!/bin/bash
for file1 in ~/Test/*.fasta
do
outfile1=${file1}_readcount.txt
awk -F ' ' -v out1=$outfile1 '{
if(NR==1) {lines[0]=lines[0] OFS FILENAME > out1;
}
if(FNR==NR) {grep -c "^>" $file1 > out1;
}
}' $file1
done
It didn't give an error but also no output
Script 2
awk '
BEGIN { OFS="\t" } #output field delimiter is tab
FNR==1 { lines[0]=lines[0] OFS FILENAME } #append filename to header record
FNR==NR {grep -c "^>" FILENAME } # counts number of ">" at the beginning of lines
END { for (i=0;i<=FNR;i++) #loop through the line numbers
print lines[i]
} #printing each line
' *fasta > countreads.txt
Here I only got the header in the file and thousands of empty rows.
Expected ouput that I would like to get
File1 2
File2 3
If you mean to count each /^>/
as a sequence and then summarize sequence counts by filenames, you can do this in awk:
awk 'FNR==1{if (name) print name, cnt; name=FILENAME; cnt=0}
/^>/{cnt++}
END{print name, cnt}' *.fasta
This also works but the file names may be in a different order than awk read them:
awk '/^>/{file_cnt[FILENAME]++}
END{for (fn in file_cnt) print fn, file_cnt[fn]}' *.fasta
You can also just use grep
directly to count the matches:
grep -c "^>" *.fasta # that will be ':' delimited.
# You can use sed, tr, awk to
# change the ':' to a '\t` or whatever.
In summary, use awk
if you want to customize what is counted or printed. Use grep
with a glob if you just want to count the total regex matches and sumarize by file name.