I want to use the fastacmd
to extract specific regions of fasta sequences.
To do that I need to put the name of the fasta file -d
, the name of the sequence -s
and the position of the sequence to extract -L
. For example:
fastacmd -d OAP11402.1.fa -s OAP11402.1 -L 50,100
But the problem is that I have hundreds of files (each file has one sequence with the same name of the file) and the info of position of each sequence to extract is in a protein database (info_sequences.txt
). So, I want to make a loop to paste the name of the file, sequence and the positions to extract from the protein database info_sequences.txt
in the fastacmd
.
The look of info_sequences.txt
is like this:
File seq_id position_start position_end
OAP11402.1.fa OAP11402.1 50 100
OAP15774.1.fa OAP15774.1 75 200
OAP10214.1.fa OAP10214.1 33 310
I think that awk
could help but i'm struggling with the way to paste the info in the fastcmd
awk 'NR>1 {print "-d",$1,"-s",$2,"-L",$3","$4}' info_sequences.txt | xargs -I {} echo fastacmd {}
Output:
fastacmd -d OAP11402.1.fa -s OAP11402.1 -L 50,100 fastacmd -d OAP15774.1.fa -s OAP15774.1 -L 75,200 fastacmd -d OAP10214.1.fa -s OAP10214.1 -L 33,310
If everything looks okay, remove echo
.