I am trying to use Biopython (Entrez) with search terms that will return the accession number (and not the GI*).
Here is a tiny excerpt of my code:
from Bio import Entrez
Entrez.email = 'myemailaddress'
search_phrase = 'Escherichia coli[organism]) AND (complete genome[keyword])'
handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=100, rettype='acc', retmode='text')
result = Entrez.read(handle)
handle.close()
gi_numbers = result['IdList']
print(gi_numbers)
'745369752', '910228862', '187736741', '802098270', '802098269', '802098267', '387610477', '544579032', '544574430', '215485161', '749295052', '387823261', '387605479', '641687520', '641682562', '594009615', '557270520', '313848522', '309700213', '284919779', '215263233', '544345556', '544340954', '144661', '51773702', '202957457', '202957451', '172051323'
I am sure I can convert from GI to accession, but it would be nice to avoid the additional step. What slice of magic am I missing?
Thank you in advance.
*especially since NCBI is phasing out GI numbers
Looking through the docs for esearch
on NCBI's website, there are only two rettype
s available - uilist
, which is the default XML format that you're getting currently (it's parsed into a dict by Entrez.read()
), and count
, which just displays the Count
value (look at the complete contents of result
, it's there), which I'm unclear on its exact meaning, as it doesn't represent the total number of items in IdList
...
At any rate, Entrez.esearch()
will take any value of rettype
and retmode
you like, but it only returns the uilist
or count
in xml
or json
mode - no accession IDs, no nothin'.
Entrez.efetch()
will pass you back all sorts of cool stuff, depending on which DB you're querying. The downside, of course, is that you need to query by one or more IDs, not by a search string, so in order to get your accession IDs you'd need to run two queries:
search_phrase = "Escherichia coli[organism]) AND (complete genome[keyword])"
handle = Entrez.esearch(db="nuccore", term=search_phrase, retmax=100)
result = Entrez.read(handle)
handle.close()
fetch_handle = Entrez.efetch(db="nuccore", id=results["IdList"], rettype="acc", retmode="text")
acc_ids = [id.strip() for id in fetch_handle]
fetch_handle.close()
print(acc_ids)
gives
['HF572917.2', 'NZ_HF572917.1', 'NC_010558.1', 'NZ_HG941720.1', 'NZ_HG941719.1', 'NZ_HG941718.1', 'NC_017633.1', 'NC_022371.1', 'NC_022370.1', 'NC_011601.1', 'NZ_HG738867.1', 'NC_012892.2', 'NC_017626.1', 'HG941719.1', 'HG941718.1', 'HG941720.1', 'HG738867.1', 'AM946981.2', 'FN649414.1', 'FN554766.1', 'FM180568.1', 'HG428756.1', 'HG428755.1', 'M37402.1', 'AJ304858.2', 'FM206294.1', 'FM206293.1', 'AM886293.1']
So, I'm not terribly sure if I answered your question satisfactorily, but unfortunately I think the answer is "There is no magic."