uniqfastqids

How to extract unique read IDs from a fastq file?


I want to extract all the unique read IDs in a fastq file and output the unique read IDs to a text file. (I have done the same task for bam files using the samtools but I don't know any tools that would handle fastq files.)

for BAM files: samtools view input.bam|cut -f1 | sort | uniq >> unique.reads.txt

for fastq: (need help)

Looking for a one-liner command or a tool that can do that.

Thank you.


Solution

  • using seqkit (no need to sort): here you basically:

    1. transform fastq to tab
    2. create an array with $1 (read id) and then run through this array, printing ids in the output file

    seqkit fx2tab reads.fq | awk -v OFS='\t' '{array[$1]=1} END {for (readID in array) print readID}' > unique.reads.txt

    also you can do this: seqkit fx2tab reads.fq | cut -f 1 | sort | uniq > unique.reads.txt

    but then you'll need to sort the file first

    or almost the same but without seqkit: grep "@" reads.fq | sort | uniq > unique.reads.txt

    grep "@" reads.fq | awk -v OFS='\t' '{array[$1]=1} END {for (readID in array) print readID}' > unique.reads.txt

    but I in general like seqkit, always advertise it