I try to access KEGG via bioservices to get certain information about a list of genes. Problem is that I do not know beforehand to which organism the individual genes belong; in my list can be a lot of genes that all belong to different organisms. My problem is that I do not know how to retrieve the desired information about the genes without specifying the organism.
To give an example:
gene_list = ['YMR293C', 'b3640']
The first gene belongs to yeast, while the second one belongs to E.coli.
If I now try:
from bioservices import *
kegg_con = KEGG()
res = kegg_con.get('b3640', parse=True)['NAME']
I end up with a TypeError
since
kegg_con.get('b3640', parse=True)
does not return a dictionary but just a number (since I do not specify the organism it belongs to). That works, however, when I specify the organism (here it is eco
which stands for E.coli):
kegg_con.get('eco:b3640', parse=True)['NAME']
returns
[u'dut']
which is correct as one can see here:
I then tried to get the information about the associated organism by using find. That works fine for YMR293C
but fails for b3640
:
kegg_con.find('genes', 'YMR293C')
returns
u'sce:YMR293C\tHER2, GEP6, QRS1, RRG6; glutamyl-tRNA(Gln) amidotransferase subunit HER2 (EC:6.3.5.7); K02433 aspartyl-tRNA(Asn)/glutamyl-tRNA(Gln) amidotransferase subunit A [EC:6.3.5.6 6.3.5.7]\ncal:CaO19.11438\tlikely amidase similar to S. cerevisiae YMR293C mitochondrial putative glutamyl-tRNA amidotransferase\ncal:CaO19.3956\tlikely amidase similar to S. cerevisiae YMR293C mitochondrial putative glutamyl-tRNA amidotransferase; K02433 aspartyl-tRNA(Asn)/glutamyl-tRNA(Gln) amidotransferase subunit A [EC:6.3.5.6 6.3.5.7]\n'
from which I can easily extract the required information (in this case: sce:YMR293C
), however, when I run
kegg_con.find('genes', 'b3640')
I get
u'cnb:CNBB3640\thypothetical protein; K06316 oligosaccharide translocation protein RFT1\ncgi:CGB_B3640C\thypothetical protein\neco:b3640\tdut; deoxyuridinetriphosphatase (EC:3.6.1.23); K01520 dUTP pyrophosphatase [EC:3.6.1.23]\nsea:SeAg_B3640\tbfd; bacterioferritin-associated ferredoxin; K02192 bacterioferritin-associated ferredoxin\nyps:YPTB3640\tconserved hypothetical protein\nreu:Reut_B3640\tconserved hypothetical protein\nbbr:BB3640\tphage-related exported protein\nmag:amb3640\thypothetical protein\nbcg:BCG9842_B3640\tflagellar hook-associated protein; K02407 flagellar hook-associated protein 2\ncbi:CLJ_B3640\tconserved hypothetical protein; K09963 uncharacterized protein\nmmo:MMOB3640\thypothetical protein\nmbo:Mb3640c\tftsH; membrane-bound protease FTSH (cell division protein) (EC:3.4.24.-); K03798 cell division protease FtsH [EC:3.4.24.-]\n'
which does not provide the information about E.coli.
My questions are therefore:
1) Is there a way so that I can access the information about a gene just based on its gene ID without specifying the organism it belongs to?
2) What would be the best way to retrieve the information to which organism the gene belongs? And why does find
fail when I search for the E.coli gene?
The output of the find() method is a pure string that is not easy to read but I believe the information you are looking for is in the output. On the third line, you can see:
eco:b3640
Now, I am not sure if the output format from KEGG is always having the same structure. If so, assuming the line of interest is the third one, you could use:
res = kegg_con.find('genes', 'b3640')
orgnanism = res.split("\n")[2].split()[0].split(":")[0]
You can further check it is a valid orgnanism as follows:
assert organism in kegg_con.organismIds
To be on the safe side, you could search for the identifier in the string (rather than taking the third line):
[x for x in res.split() if "b3640" in x]
Hopes it helps
TC, the main author of bioservices