I'm trying parsing and hierarchical display of Gene Ontology (GO) terms from an OBO file using Python. While I have made progress, I'm encountering an issue with properly handling multiple is_a relationships within the same term. My goal is to achieve a hierarchical structure that considers all is_a relationships.
I'm working with a subset of the Gene Ontology data from the go-basic.obo file. Here's an example of the data format:
format-version: 1.2
data-version: releases/2023-06-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go
[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution
[Term]
id: GO:0048308
name: organelle inheritance
namespace: biological_process
def: "The partitioning of organelles between daughter cells at cell division." [GOC:jid]
subset: goslim_pir
subset: goslim_yeast
is_a: GO:0006996 ! organelle organization
[Term]
id: GO:0007029
name: endoplasmic reticulum organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the endoplasmic reticulum." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "endoplasmic reticulum morphology" RELATED []
synonym: "endoplasmic reticulum organisation" EXACT []
synonym: "endoplasmic reticulum organization and biogenesis" RELATED [GOC:mah]
synonym: "ER organisation" EXACT []
synonym: "ER organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization
[Term]
id: GO:0048309
name: endoplasmic reticulum inheritance
namespace: biological_process
def: "The partitioning of endoplasmic reticulum between daughter cells at cell division." [GOC:jid]
synonym: "ER inheritance" EXACT []
is_a: GO:0007029 ! endoplasmic reticulum organization
is_a: GO:0048308 ! organelle inheritance
[Term]
id: GO:0048313
name: Golgi inheritance
namespace: biological_process
def: "The partitioning of Golgi apparatus between daughter cells at cell division." [GOC:jid, PMID:12851069]
synonym: "Golgi apparatus inheritance" EXACT []
synonym: "Golgi division" EXACT [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi partitioning" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:0048308 ! organelle inheritance
[Term]
id: GO:0007030
name: Golgi organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the Golgi apparatus." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "Golgi apparatus organization" EXACT []
synonym: "Golgi organisation" EXACT []
synonym: "Golgi organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization
[Term]
id: GO:0090166
name: Golgi disassembly
namespace: biological_process
def: "A cellular process that results in the breakdown of a Golgi apparatus that contributes to Golgi inheritance." [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi apparatus disassembly" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:1903008 ! organelle disassembly
relationship: part_of GO:0048313 ! Golgi inheritance
[Term]
id: GO:1903008
name: organelle disassembly
namespace: biological_process
def: "The disaggregation of an organelle into its constituent components." [GO_REF:0000079, GOC:TermGenie]
synonym: "organelle degradation" EXACT []
is_a: GO:0006996 ! organelle organization
is_a: GO:0022411 ! cellular component disassembly
[Term]
id: GO:0006996
name: organelle organization
namespace: biological_process
alt_id: GO:1902589
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an organelle within a cell. An organelle is an organized structure of distinctive morphology and function. Includes the nucleus, mitochondria, plastids, vacuoles, vesicles, ribosomes and the cytoskeleton. Excludes the plasma membrane." [GOC:mah]
subset: goslim_candida
subset: goslim_pir
synonym: "organelle organisation" EXACT []
synonym: "organelle organization and biogenesis" RELATED [GOC:dph, GOC:jl, GOC:mah]
synonym: "single organism organelle organization" EXACT [GOC:TermGenie]
synonym: "single-organism organelle organization" RELATED []
is_a: GO:0016043 ! cellular component organization
I used this code
def parse_obo(file_path):
terms = {}
current_term = None
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
if not line:
if current_term:
terms[current_term['id']] = current_term
current_term = None
elif line.startswith('[Term]'):
if current_term:
terms[current_term['id']] = current_term
current_term = {'id': ''}
elif current_term:
parts = line.split(': ', 1)
if len(parts) == 2:
current_term[parts[0]] = parts[1]
return terms
def display_hierarchy(terms, term_id, indent=0):
if term_id in terms:
term = terms[term_id]
print(' ' * indent + term_id)
if 'is_a' in term:
parent_ids = [parent.split()[1] for parent in term['is_a'] if len(parent.split()) > 1]
for parent_id in parent_ids:
display_hierarchy(terms, parent_id, indent + 4)
if 'id' in term:
child_ids = [child_id for child_id in terms if term_id in terms[child_id].get('is_a', [])]
for child_id in child_ids:
display_hierarchy(terms, child_id, indent + 4)
if __name__ == "__main__":
file_path = 'go-basic_1.obo'
terms = parse_obo(file_path)
for term_id in terms:
display_hierarchy(terms, term_id, indent=0)
I got like this
GO:0000001
GO:0048308
GO:0048309
GO:0048313
GO:0007029
GO:0048309
GO:0048313
GO:0007030
GO:0090166
GO:1903008
GO:0090166
GO:0006996
GO:0048308
GO:0048309
GO:0048313
GO:0007029
GO:0007030
but I want result like this
GO:0016043
GO:0006996
GO:1903008
GO:0090166
GO:0048308
GO:0000001
GO:0048309
GO:0048313
GO:0007029
GO:0048309
GO:0007030
GO:0048313
GO:0090166
GO:0048311
GO:0000001
GO:0022411
GO:1903008
GO:0090166
I want to plot result from my genomic data for gene ontology, so I started from here , kindly help
You would need to take care of these points:
As is_a
may occur multiple times per item, you would need to collect them in a collection, as otherwise you will overwrite a previous value and only retain the last value encountered per term. I would generalise this, and make all items in a term to have list values, except maybe for id
, which should occur only once per term
To display the hierarchy you would benefit from having the relation from parent to children, instead of child to parents. So I would suggest including a separate function to add this inversed relationship to the terms.
Here is how that would look:
def parse_obo(file_path):
terms = {}
current_term = {}
isterm = False
with open(file_path, 'r') as f:
for line in f:
line = line.strip()
isterm = isterm or line.startswith('id:')
if isterm and ": " in line:
key, value = line.split(': ', 1)
if key == "id":
current_term = terms.setdefault(value, {})
current_term["id"] = value
else:
current_term.setdefault(key, []).append(value)
return terms
def make_hierarchy(terms):
for term in list(terms.values()):
term.setdefault("children", [])
if "is_a" in term:
for is_a in term["is_a"]:
parent = is_a.split()[0]
terms.setdefault(parent, { 'id': parent }).setdefault("children", []).append(term)
return [term for term in terms.values() if "is_a" not in term]
def display_hierarchy(terms, indent=""):
for term in terms:
print(f"{indent}{term['id']}")
display_hierarchy(term['children'], indent + " ")
if __name__ == "__main__":
file_path = 'go-basic_1.obo'
terms = parse_obo(file_path)
roots = make_hierarchy(terms)
display_hierarchy(roots)