I have a data in that always comes in block of four in the following format (called FASTQ):
@SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
+SRR018006.2016 GA2:6:1:20:650 length=36
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!+!
@SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
+SRR018006.19405469 GA2:6:100:1793:611 length=36
7);;).;);;/;*.2>/@@7;@77<..;)58)5/>/
Is there a simple sed/awk/bash way to convert them into this format (called FASTA):
>SRR018006.2016 GA2:6:1:20:650 length=36
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGN
>SRR018006.19405469 GA2:6:100:1793:611 length=36
ACCCGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
In principle, we want to extract the first two lines in each block-of-4
and replace @
with >
.
This is an old question, and there have been many different solutions offered. Since the accepted answer uses sed but has a glaring problem (which is that it will replace @ with > when the @ sign appears as the first letter of the quality line), I feel compelled to offer a simple sed-based solution that actually works:
sed -n '1~4s/^@/>/p;2~4p'
The only assumption made is that each read occupies exactly 4 lines in the FASTQ file, but that seems pretty safe, in my experience.
The fastq_to_fasta script in the fastx toolkit also works. (It's worth mentioning that you need to specify the -Q33 option to accommodate the now common Phred+33 qual encodings. Which is funny, since it's throwing away the quality data anyway!)