bioinformaticsdna-sequencencbi

Can I use Entrez Direct to query multiple nucleotide accession version identifiers against a database without using epost?


I have downloaded a hit table from blast NCBI (Nucleotide blast using the nucleotide collection database and megablast program) and used awk to order it by accession version identities.

awk -F "\t" 'NF>1{print}' unsorted_input.txt | sort -k2 > sorted_output.txt

I then used Entrez Direct to use the accession version identifiers to extract the subject organism of each alignment:

awk -F "\t" 'NF>1{print $2}' unsorted_input.txt | epost -db nucleotide | efetch -format docsum | xtract -pattern DocumentSummary -element Organism | sort | paste sorted_output.txt - > final_output.txt

This command was able to extract the subject organism data for some alignments but not all. I noticed that for alignments that epost did not work for, individually querying them with esearch did work:

esearch -db nucleotide -query "accession_version_identifier" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism

So, I attempted to use this approach with a loop, using the accession version identifier (second column) of each line to extract the subject organism name as such:

while IFS=$'\t' read -r -a myArray
do
 esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Organism > "output.txt"
done < input.txt

However, this only returned the subject organism of the first row. How can I apply this to every row, storing all subject organisms in the same file?

The first few lines of the input file can be found below. It is tab delimited:

ce1e013e-c4c5-47f9-b041-521ee293c4f0    AB002282.1  91.217  649 24  22  41  676 8   636 0.0 854
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  84.615  117 9   6   118 228 17668   17781   5.16e-19    108
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  84.615  117 9   6   118 228 20740   20853   5.16e-19    108
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  84.615  117 9   6   118 228 23812   23925   5.16e-19    108
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  84.615  117 9   6   118 228 26884   26997   5.16e-19    108
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  84.615  117 9   6   118 228 29956   30069   5.16e-19    108
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  85.345  116 9   6   118 228 33027   33139   1.11e-20    113
c10d7882-cc00-4ee2-8643-9b27fef66e83    AB828191.1  87.000  100 7   5   132 228 14613   14709   5.16e-19    108
8e8ac3f3-63f6-4519-ad25-287a25169f87    AB850654.1  88.262  4660    175 260 16  4401    103840  108401  0.0 5232
c4233926-9f23-46c4-bc4d-5702f47885bd    AB850654.1  89.958  4272    119 235 1   4042    104203  108394  0.0 5227
876d8f20-9d36-4207-8754-0924d99a6c46    AC019188.6  91.855  221 4   7   3   210 78509   78290   1.39e-75    296

Solution

  • I have fixed the problem:

    while IFS=$'\t' read -r -a myArray do echo | esearch -db nucleotide -query "${myArray[1]}" | efetch -format docsum | xtract -pattern DocumentSummary -element Title,Organism >> output.txt done < input.txt