I have downloaded a hit table from blast NCBI (Nucleotide blast using the nucleotide collection database and megablast program) and used awk to order it by accession version identities.
awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt
I then used Entrez Direct to use the accession version identifiers to extract the subject organism of each alignment:
awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt
This command was able to extract the subject organism data for some alignments but not all. I noticed that for alignments that epost did not work for, individually querying them with esearch did work:
esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism
So, I attempted to use this approach with a loop, using the accession version identifier (second column) of each line to extract the subject organism name as such:
while IFS=$'\t' read -r -a myArray
do
esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt
However, this only returned the subject organism of the first row. How can I apply this to every row, storing all subject organisms in the same file?
The first few lines of the input file can be found below. It is tab delimited:
ce1e013e-c4c5-47f9-b041-521ee293c4f0 AB002282.1 91.217 649 24 22 41 676 8 636 0.0 854
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 17668 17781 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 20740 20853 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 23812 23925 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 26884 26997 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 84.615 117 9 6 118 228 29956 30069 5.16e-19 108
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 85.345 116 9 6 118 228 33027 33139 1.11e-20 113
c10d7882-cc00-4ee2-8643-9b27fef66e83 AB828191.1 87.000 100 7 5 132 228 14613 14709 5.16e-19 108
8e8ac3f3-63f6-4519-ad25-287a25169f87 AB850654.1 88.262 4660 175 260 16 4401 103840 108401 0.0 5232
c4233926-9f23-46c4-bc4d-5702f47885bd AB850654.1 89.958 4272 119 235 1 4042 104203 108394 0.0 5227
876d8f20-9d36-4207-8754-0924d99a6c46 AC019188.6 91.855 221 4 7 3 210 78509 78290 1.39e-75 296
I have fixed the problem:
while IFS=$'\t' read -r -a myArray do echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt done < input.txt