fasta

Extract UNIQUE fasta sequences from a text file


I have a TXT file:

HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1 
HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2 
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/1
HISEQ1:105:C0A57ACXX:2:1101:10000:121322/2 
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/1 
HISEQ1:105:C0A57ACXX:2:1101:10000:12798/2

and a fasta file with sequences:

>HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1 
GCACCCTCGGGGGAGCAACGAAGAGGTAGACGAAGGCGATCGCAGCCACCTGCGGCAGTATCCCCAGGAGGTCAAGGTCCTCCTCCCCGCTCACCGTCGCC
>HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
TTGGTGGCAGGCAACAGCTTTGGACGGCCACCGCCTCATGGCGCCTCCTCCCCGCTGCGTCCTCGCCGCGTCCCTCCCTGCTTCAAGC
>HISEQ1:85:D0C0FABXX:5:1101:1385:36009/1
TTTAGTTCCAGGCCGGCTGAAGACTGTCTTTACAAAAGAAAAGTTTAGCCTAGGAAGCCCATTGTTGTAGGTGTTGTAGTTTTATAGATGTACTTTGGAAA
>HISEQ1:85:D0C0FABXX:5:1101:1385:36009/2
CAGCCAAGTTCGCAGTCTCGATAGTATTGTTTTCATACAGCAGTCTTGACAAACCAAAGTCCGCAACTTTTGGTTCCAGATTATCATCTAGCAATATGTTT
>HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
TTGGTGGCAGGCAACAGCTTTGGACGGCCACCGCCTCATGGCGCCTCCTCCCCGCTGCGTCCTCGCCGCGTCCCTCCCTGCTTCAAGC

I would like to extract the sequences of these ID's only once from the fasta file and get this output:

>HISEQ1:105:C0A57ACXX:2:1101:10000:105587/1 
GCACCCTCGGGGGAGCAACGAAGAGGTAGACGAAGGCGATCGCAGCCACCTGCGGCAGTATCCCCAGGAGGTCAAGGTCCTCCTCCCCGCTCACCGTCGCC
>HISEQ1:105:C0A57ACXX:2:1101:10000:105587/2
TTGGTGGCAGGCAACAGCTTTGGACGGCCACCGCCTCATGGCGCCTCCTCCCCGCTGCGTCCTCGCCGCGTCCCTCCCTGCTTCAAGC

but I get also dublicates. I tried these:

seqkit grep -f in.txt in.fa > out.fa 
seqtk subseq in.fa in.txt > out.fa

How to modify the command line above to get unique sequences?


Solution

  • Try with

    seqkit grep -f in.txt in.fa | seqkit rmdup -n -o out.fa