I'm attempting to use BioPython's REST module from Bio.KEGG to query the KEGG database to get the names and formulas of some compounds, using the compounds chemical identification number (CID), e.g. C0001 is water, C00123 is leucine, etc:
from Bio.KEGG import REST
from Bio.KEGG import Compound
def cpd_decoder(cid): #gets the compound name and formula from KEGG
if "C" in cid:
cid="cpd:"+cid
kegg_entry=REST.kegg_get(cid)
for record in Compound.parse(kegg_entry):
cid_name=record.name[0]
cid_formula=record.formula
return cid_name,cid_formula
cid="C00123" #example CID; this one's for leucine
if cpd_decoder(cid) !=None:
compound,formula=cpd_decoder(cid)
However, despite the fact that BioPython is using KEGG's own API, I almost always get the following error:
if cpd_decoder(cid) !=None:
File "/media/tessa/Storage/Alien_Earths/Network_expansion/network expansion test 2.py", line 27, in cpd_decoder
kegg_entry=REST.kegg_get(cid)
File "/home/tessa/.local/lib/python3.10/site-packages/Bio/KEGG/REST.py", line 208, in kegg_get
resp = _q("get", dbentries)
File "/home/tessa/.local/lib/python3.10/site-packages/Bio/KEGG/REST.py", line 44, in _q
resp = urlopen(URL % (args))
File "/usr/lib/python3.10/urllib/request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.10/urllib/request.py", line 525, in open
response = meth(req, response)
File "/usr/lib/python3.10/urllib/request.py", line 634, in http_response
response = self.parent.error(
File "/usr/lib/python3.10/urllib/request.py", line 563, in error
return self._call_chain(*args)
File "/usr/lib/python3.10/urllib/request.py", line 496, in _call_chain
result = func(*args)
File "/usr/lib/python3.10/urllib/request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
I'm wondering if because I'm working with a large list of CIDs, KEGG now thinks I'm a bot and is blocking me. Is there a way to work around this or resolve the issue?
I found the fix, and it sped up the code a lot. I have not used Biopython before, but rather the requests package in python. Probably you can do the same in Biopython.
Rather than making a connection request for every KO number (or in your case compound ID) separately, you can put all KO numbers in a single request. So instead of requesting:
https://rest.kegg.jp/link/reaction/ko:K00012
https://rest.kegg.jp/link/reaction/ko:K12450
etc..
you can do this:
https://rest.kegg.jp/link/reaction/ko:K00012+K12450+<insert all your queries separated by a "+">
This also runs much faster because you just need to wait for KEGG to respond a single time. Then you just need to parse the result (probably Biopython can do that already)
Here's what my code for this looks like:
import requests
#Replace by your own query
KO_numbers = ["K00012", "K12450", "K21379"]
#Define the start of the URL, replace with the URL for your own need
url = "https://rest.kegg.jp/link/reaction/ko:"
#For each KO number in the list: add it to the URL, and put a "+" in between
for KO in KO_numbers:
url += KO
url += "+"
#Do the actual request, raise an error if something is wrong
response = requests.get(url)
if response.status_code != 200:
raise ConnectionError("Cannot connect to KEGG API")
#Here I just print the response, but from here you need to parse it to do what you want to do with the data
print(response.text)