I am trying to download sequence data from E. coli samples within the state of Washington - it's about 1283 sequences, which I know is a lot. The problem that I am running into is that entrez_search and/or entrez_fetch seem to be pulling the wrong data. For example, the following R code does pull 1283 IDs, but when I use entrez_fetch on those IDs, the sequence data I get is from chickens and corn and things that are not E. coli:
search <- entrez_search(db = "biosample",
term = "Escherichia coli[Organism] AND geo_loc_name=USA:WA[attr]",
retmax = 9999, use_history = T)
Similarly, I tried pulling the sequence from one sample manually as a test. When I search for the accession number SAMN30954130 on the NCBI website, I see metadata for an E. coli sample. When I use this code, I see metadata for a chicken:
search <- entrez_search(db = "biosample",
term = "SAMN30954130[ACCN]",
retmax = 9999, use_history = T)
fetch_test <- entrez_fetch(db = "nucleotide",
id = search$ids,
rettype = "xml")
fetch_list <- xmlToList(fetch_test)
The issue here is that you are using a Biosample UID to query the Nucleotide database. However, the UID is then interpreted as a Nucleotide UID, so you get a sequence record unrelated to your original Biosample query.
What you need to use in this situation is entrez_link
, which uses a UID to link records between two databases.
For example, your Biosample accession SAMN30954130 has the Biosample UID 30954130. You link that to Nucleotide like this:
nuc_links <- entrez_link(dbfrom='biosample', id=30954130, db='nuccore')
And you can get the corresponding Nucleotide UID(s) like this:
nuc_links$links$biosample_nuccore
[1] "2307876014"
And then:
fetch_test <- entrez_fetch(db = "nucleotide",
id = 2307876014,
rettype = "xml")
This is covered in the section "Finding cross-references" of the rentrez tutorial.