pythonpython-3.xrecursionbioinformaticsontology-mapping

Retrieving Gene Ontology Hierarchy Using Python


I'm trying parsing and hierarchical display of Gene Ontology (GO) terms from an OBO file using Python. While I have made progress, I'm encountering an issue with properly handling multiple is_a relationships within the same term. My goal is to achieve a hierarchical structure that considers all is_a relationships.

I'm working with a subset of the Gene Ontology data from the go-basic.obo file. Here's an example of the data format:

    format-version: 1.2
data-version: releases/2023-06-11
subsetdef: chebi_ph7_3 "Rhea list of ChEBI terms representing the major species at pH 7.3."
subsetdef: gocheck_do_not_annotate "Term not to be used for direct annotation"
subsetdef: gocheck_do_not_manually_annotate "Term not to be used for direct manual annotation"
subsetdef: goslim_agr "AGR slim"
subsetdef: goslim_aspergillus "Aspergillus GO slim"
subsetdef: goslim_candida "Candida GO slim"
subsetdef: goslim_chembl "ChEMBL protein targets summary"
subsetdef: goslim_drosophila "Drosophila GO slim"
subsetdef: goslim_flybase_ribbon "FlyBase Drosophila GO ribbon slim"
subsetdef: goslim_generic "Generic GO slim"
subsetdef: goslim_metagenomics "Metagenomics GO slim"
subsetdef: goslim_mouse "Mouse GO slim"
subsetdef: goslim_pir "PIR GO slim"
subsetdef: goslim_plant "Plant GO slim"
subsetdef: goslim_pombe "Fission yeast GO slim"
subsetdef: goslim_synapse "synapse GO slim"
subsetdef: goslim_yeast "Yeast GO slim"
subsetdef: prokaryote_subset "GO subset for prokaryotes"
synonymtypedef: syngo_official_label "label approved by the SynGO project"
synonymtypedef: systematic_synonym "Systematic synonym" EXACT
default-namespace: gene_ontology
ontology: go

[Term]
id: GO:0000001
name: mitochondrion inheritance
namespace: biological_process
def: "The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton." [GOC:mcc, PMID:10873824, PMID:11389764]
synonym: "mitochondrial inheritance" EXACT []
is_a: GO:0048308 ! organelle inheritance
is_a: GO:0048311 ! mitochondrion distribution

[Term]
id: GO:0048308
name: organelle inheritance
namespace: biological_process
def: "The partitioning of organelles between daughter cells at cell division." [GOC:jid]
subset: goslim_pir
subset: goslim_yeast
is_a: GO:0006996 ! organelle organization

[Term]
id: GO:0007029
name: endoplasmic reticulum organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the endoplasmic reticulum." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "endoplasmic reticulum morphology" RELATED []
synonym: "endoplasmic reticulum organisation" EXACT []
synonym: "endoplasmic reticulum organization and biogenesis" RELATED [GOC:mah]
synonym: "ER organisation" EXACT []
synonym: "ER organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization

[Term]
id: GO:0048309
name: endoplasmic reticulum inheritance
namespace: biological_process
def: "The partitioning of endoplasmic reticulum between daughter cells at cell division." [GOC:jid]
synonym: "ER inheritance" EXACT []
is_a: GO:0007029 ! endoplasmic reticulum organization
is_a: GO:0048308 ! organelle inheritance

[Term]
id: GO:0048313
name: Golgi inheritance
namespace: biological_process
def: "The partitioning of Golgi apparatus between daughter cells at cell division." [GOC:jid, PMID:12851069]
synonym: "Golgi apparatus inheritance" EXACT []
synonym: "Golgi division" EXACT [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi partitioning" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:0048308 ! organelle inheritance

[Term]
id: GO:0007030
name: Golgi organization
namespace: biological_process
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of the Golgi apparatus." [GOC:dph, GOC:jl, GOC:mah]
subset: goslim_pir
synonym: "Golgi apparatus organization" EXACT []
synonym: "Golgi organisation" EXACT []
synonym: "Golgi organization and biogenesis" RELATED [GOC:mah]
is_a: GO:0006996 ! organelle organization
relationship: part_of GO:0010256 ! endomembrane system organization

[Term]
id: GO:0090166
name: Golgi disassembly
namespace: biological_process
def: "A cellular process that results in the breakdown of a Golgi apparatus that contributes to Golgi inheritance." [GOC:ascb_2009, GOC:dph, GOC:tb]
synonym: "Golgi apparatus disassembly" EXACT []
is_a: GO:0007030 ! Golgi organization
is_a: GO:1903008 ! organelle disassembly
relationship: part_of GO:0048313 ! Golgi inheritance

[Term]
id: GO:1903008
name: organelle disassembly
namespace: biological_process
def: "The disaggregation of an organelle into its constituent components." [GO_REF:0000079, GOC:TermGenie]
synonym: "organelle degradation" EXACT []
is_a: GO:0006996 ! organelle organization
is_a: GO:0022411 ! cellular component disassembly


[Term]
id: GO:0006996
name: organelle organization
namespace: biological_process
alt_id: GO:1902589
def: "A process that is carried out at the cellular level which results in the assembly, arrangement of constituent parts, or disassembly of an organelle within a cell. An organelle is an organized structure of distinctive morphology and function. Includes the nucleus, mitochondria, plastids, vacuoles, vesicles, ribosomes and the cytoskeleton. Excludes the plasma membrane." [GOC:mah]
subset: goslim_candida
subset: goslim_pir
synonym: "organelle organisation" EXACT []
synonym: "organelle organization and biogenesis" RELATED [GOC:dph, GOC:jl, GOC:mah]
synonym: "single organism organelle organization" EXACT [GOC:TermGenie]
synonym: "single-organism organelle organization" RELATED []
is_a: GO:0016043 ! cellular component organization

I used this code

def parse_obo(file_path):
    terms = {}
    current_term = None
    
    with open(file_path, 'r') as f:
        for line in f:
            line = line.strip()
            if not line:
                if current_term:
                    terms[current_term['id']] = current_term
                    current_term = None
            elif line.startswith('[Term]'):
                if current_term:
                    terms[current_term['id']] = current_term
                current_term = {'id': ''}
            elif current_term:
                parts = line.split(': ', 1)
                if len(parts) == 2:
                    current_term[parts[0]] = parts[1]
    
    return terms

def display_hierarchy(terms, term_id, indent=0):
    if term_id in terms:
        term = terms[term_id]
        print(' ' * indent + term_id)
        
        if 'is_a' in term:
            parent_ids = [parent.split()[1] for parent in term['is_a'] if len(parent.split()) > 1]
            for parent_id in parent_ids:
                display_hierarchy(terms, parent_id, indent + 4)

        if 'id' in term:
            child_ids = [child_id for child_id in terms if term_id in terms[child_id].get('is_a', [])]
            for child_id in child_ids:
                display_hierarchy(terms, child_id, indent + 4)

if __name__ == "__main__":
    file_path = 'go-basic_1.obo'
    terms = parse_obo(file_path)
    
    for term_id in terms:
        display_hierarchy(terms, term_id, indent=0)

I got like this

GO:0000001
GO:0048308
    GO:0048309
    GO:0048313
GO:0007029
GO:0048309
GO:0048313
GO:0007030
GO:0090166
GO:1903008
    GO:0090166
GO:0006996
    GO:0048308
        GO:0048309
        GO:0048313
    GO:0007029
    GO:0007030

but I want result like this

GO:0016043
    GO:0006996
        GO:1903008
            GO:0090166
        GO:0048308
            GO:0000001
            GO:0048309
            GO:0048313
        GO:0007029
            GO:0048309
        GO:0007030
            GO:0048313
            GO:0090166
GO:0048311
    GO:0000001
GO:0022411
    GO:1903008
    GO:0090166

I want to plot result from my genomic data for gene ontology, so I started from here , kindly help


Solution

  • You would need to take care of these points:

    Here is how that would look:

    def parse_obo(file_path):
        terms = {}
        current_term = {}
        isterm = False
        with open(file_path, 'r') as f:
            for line in f:
                line = line.strip()
                isterm = isterm or line.startswith('id:')
                if isterm and ": " in line:
                    key, value = line.split(': ', 1)
                    if key == "id":
                        current_term = terms.setdefault(value, {})
                        current_term["id"] = value
                    else:
                        current_term.setdefault(key, []).append(value)
        return terms
    
    def make_hierarchy(terms):
        for term in list(terms.values()):
            term.setdefault("children", [])
            if "is_a" in term:
                for is_a in term["is_a"]:
                    parent = is_a.split()[0]
                    terms.setdefault(parent, { 'id': parent }).setdefault("children", []).append(term)
        return [term for term in terms.values() if "is_a" not in term]
        
    def display_hierarchy(terms, indent=""):
        for term in terms:
            print(f"{indent}{term['id']}")
            display_hierarchy(term['children'], indent + "  ")
    
    if __name__ == "__main__":
        file_path = 'go-basic_1.obo'
        terms = parse_obo(file_path)
        roots = make_hierarchy(terms)
        display_hierarchy(roots)