pythonxmlbioinformaticsbiopythonncbi

How to retrieve NCBI Entrez summary using gene name with Biopython?


I've explored a variety of options and solutions online, but I can't seem to quite figure this out. I'm new to using Entrez so I don't fully understand how it works, but below was my attempt.

My goal would be to print out the online summary, so for instance for Kat2a I'd want it to print out 'Enables H3 histone acetyltransferase activity; chromatin binding activity; and histone acetyltransferase activity (H4-K12 specific). Involved in several processes' ...etc, from the summary section on NCBI.

from Bio import Entrez

def get_summary(gene_name):
    Entrez.email = 'x'

    query = f'{gene_name}[Gene Name]'
    handle = Entrez.esearch(db='gene', term=query)
    record = Entrez.read(handle)
    handle.close()

    NCBI_ids = record['IdList']
    for id in NCBI_ids:
        handle = Entrez.esummary(db='gene', id=id)
        record = Entrez.read(handle)
        print(record['Summary'])
    return 0

Solution

  • Using Biopython to fetch all gene IDs associated with a provided gene name¹ and gathering all gene summaries per ID²

    You were on the right track. Here is one example that further fleshes out the approach you initiated and provide in your question. The function below (still, more customization of course could be done) takes into account the default Entrez.esearch max returned Gene IDs of 20 (overriding by default to 100), and also performs the query itself filtering by organism (unless the default 'human' is set to None).

    import time
    import xmltodict
    
    from collections import defaultdict
    
    from Bio import Entrez
    
    
    def get_entrez_gene_summary(
        gene_name, email, organism="human", max_gene_ids=100
    ):
        """Returns the 'Summary' contents for provided input
        gene from the Entrez Gene database. All gene IDs 
        returned for input gene_name will have their docsum
        summaries 'fetched'.
        
        Args:
            gene_name (string): Official (HGNC) gene name 
               (e.g., 'KAT2A')
            email (string): Required email for making requests
            organism (string, optional): defaults to human. 
               Filters results only to match organism. Set to None
               to return all organism unfiltered.
            max_gene_ids (int, optional): Sets the number of Gene
               ID results to return (absolute max allowed is 10K).
            
        Returns:
            dict: Summaries for all gene IDs associated with 
               gene_name (where: keys → [orgn][gene name],
                          values → gene summary)
        """
        Entrez.email = email
    
        query = (
            f"{gene_name}[Gene Name]"
            if not organism
            else f"({gene_name}[Gene Name]) AND {organism}[Organism]"
        )
        handle = Entrez.esearch(db="gene", term=query, retmax=max_gene_ids)
        record = Entrez.read(handle)
        handle.close()
    
        gene_summaries = defaultdict(dict)
        gene_ids = record["IdList"]
    
        print(
            f"{len(gene_ids)} gene IDs returned associated with gene {gene_name}."
        )
        for gene_id in gene_ids:
            print(f"\tRetrieving summary for {gene_id}...")
            handle = Entrez.efetch(db="gene", id=gene_id, rettype="docsum")
            gene_dict = xmltodict.parse(
                "".join([x.decode(encoding="utf-8") for x in handle.readlines()]),
                dict_constructor=dict,
            )
            gene_docsum = gene_dict["eSummaryResult"]["DocumentSummarySet"][
                "DocumentSummary"
            ]
            name = gene_docsum.get("Name")
            summary = gene_docsum.get("Summary")
            gene_organism = gene_docsum.get("Organism")["CommonName"]
            gene_summaries[gene_organism][name] = summary
            handle.close()
            time.sleep(0.34)  # Requests to NCBI are rate limited to 3 per second
    
        return gene_summaries
    
    

    Example 1 – Fetching the gene summary for KAT2A

    >>> email = # [insert private email]
    >>> gene_summaries = get_entrez_gene_summary("KAT2A", email)
    

    returns just one gene summary (remember the default is organism='human'):

    1. KAT2A
    KAT2A, or GCN5, is a histone acetyltransferase (HAT) that functions primarily as a transcriptional activator. It also functions as a repressor of NF-kappa-B (see MIM 164011) by promoting ubiquitination of the NF-kappa-B subunit RELA (MIM 164014) in a HAT-independent manner (Mao et al., 2009 [PubMed 19339690]).[supplied by OMIM, Sep 2009]
    

    Example 2 – Using wildcards and receiving many genes for a single organism

    For example, gene summaries for all human aldehyde dehydrogenase genes can be obtained using the query ALDH* (the asterisk representing a wildcard):

    >>> email = # enter private email
    >>> gene_summaries = get_entrez_gene_summary("ALDH*", email, max_gene_ids=50)
    28 gene IDs returned associated with gene ALDH*.
        Retrieving summary for 217...
        Retrieving summary for 216...
        Retrieving summary for 501...
        Retrieving summary for 220...
        Retrieving summary for 224...
        Retrieving summary for 7915...
        Retrieving summary for 218...
        Retrieving summary for 5832...
        Retrieving summary for 219...
        Retrieving summary for 10840...
        Retrieving summary for 8854...
        Retrieving summary for 8540...
        Retrieving summary for 223...
        Retrieving summary for 8659...
        Retrieving summary for 4329...
        Retrieving summary for 221...
        Retrieving summary for 222...
        Retrieving summary for 126133...
        Retrieving summary for 160428...
        Retrieving summary for 64577...
        Retrieving summary for 541...
        Retrieving summary for 100862662...
        Retrieving summary for 544...
        Retrieving summary for 543...
        Retrieving summary for 542...
        Retrieving summary for 101927751...
        Retrieving summary for 283665...
        Retrieving summary for 100874204...
    >>> for i, (k, v) in enumerate(gene_summaries["human"].items()):
    ...    print(f"{i+1}. {k}")
    ...    print(v, end="\n\n")
    
    1. ALDH2
    This protein belongs to the aldehyde dehydrogenase family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. Two major liver isoforms of aldehyde dehydrogenase, cytosolic and mitochondrial, can be distinguished by their electrophoretic mobilities, kinetic properties, and subcellular localizations. Most Caucasians have two major isozymes, while approximately 50% of East Asians have the cytosolic isozyme but not the mitochondrial isozyme. A remarkably higher frequency of acute alcohol intoxication among East Asians than among Caucasians could be related to the absence of a catalytically active form of the mitochondrial isozyme. The increased exposure to acetaldehyde in individuals with the catalytically inactive form may also confer greater susceptibility to many types of cancer. This gene encodes a mitochondrial isoform, which has a low Km for acetaldehydes, and is localized in mitochondrial matrix. Alternative splicing results in multiple transcript variants encoding distinct isoforms.[provided by RefSeq, Nov 2016]
    
    2. ALDH1A1
    The protein encoded by this gene belongs to the aldehyde dehydrogenase family. Aldehyde dehydrogenase is the next enzyme after alcohol dehydrogenase in the major pathway of alcohol metabolism. There are two major aldehyde dehydrogenase isozymes in the liver, cytosolic and mitochondrial, which are encoded by distinct genes, and can be distinguished by their electrophoretic mobility, kinetic properties, and subcellular localization. This gene encodes the cytosolic isozyme. Studies in mice show that through its role in retinol metabolism, this gene may also be involved in the regulation of the metabolic responses to high-fat diet. [provided by RefSeq, Mar 2011]
    
    3. ALDH7A1
    The protein encoded by this gene is a member of subfamily 7 in the aldehyde dehydrogenase gene family. These enzymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This particular member has homology to a previously described protein from the green garden pea, the 26g pea turgor protein. It is also involved in lysine catabolism that is known to occur in the mitochondrial matrix. Recent reports show that this protein is found both in the cytosol and the mitochondria, and the two forms likely arise from the use of alternative translation initiation sites. An additional variant encoding a different isoform has also been found for this gene. Mutations in this gene are associated with pyridoxine-dependent epilepsy. Several related pseudogenes have also been identified. [provided by RefSeq, Jan 2011]
    
    4. ALDH1A3
    This gene encodes an aldehyde dehydrogenase enzyme that uses retinal as a substrate. Mutations in this gene have been associated with microphthalmia, isolated 8, and expression changes have also been detected in tumor cells. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2014]
    
    5. ALDH3A2
    Aldehyde dehydrogenase isozymes are thought to play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. This gene product catalyzes the oxidation of long-chain aliphatic aldehydes to fatty acid. Mutations in the gene cause Sjogren-Larsson syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Jul 2008]
    
    6. ALDH5A1
    This protein belongs to the aldehyde dehydrogenase family of proteins. This gene encodes a mitochondrial NAD(+)-dependent succinic semialdehyde dehydrogenase. A deficiency of this enzyme, known as 4-hydroxybutyricaciduria, is a rare inborn error in the metabolism of the neurotransmitter 4-aminobutyric acid (GABA). In response to the defect, physiologic fluids from patients accumulate GHB, a compound with numerous neuromodulatory properties. Two transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, Jul 2008]
    
    7. ALDH3A1
    Aldehyde dehydrogenases oxidize various aldehydes to the corresponding acids. They are involved in the detoxification of alcohol-derived acetaldehyde and in the metabolism of corticosteroids, biogenic amines, neurotransmitters, and lipid peroxidation. The enzyme encoded by this gene forms a cytoplasmic homodimer that preferentially oxidizes aromatic and medium-chain (6 carbons or more) saturated and unsaturated aldehyde substrates. It is thought to promote resistance to UV and 4-hydroxy-2-nonenal-induced oxidative damage in the cornea. The gene is located within the Smith-Magenis syndrome region on chromosome 17. Multiple alternatively spliced variants, encoding the same protein, have been identified. [provided by RefSeq, Sep 2008]
    
    8. ALDH18A1
    This gene is a member of the aldehyde dehydrogenase family and encodes a bifunctional ATP- and NADPH-dependent mitochondrial enzyme with both gamma-glutamyl kinase and gamma-glutamyl phosphate reductase activities. The encoded protein catalyzes the reduction of glutamate to delta1-pyrroline-5-carboxylate, a critical step in the de novo biosynthesis of proline, ornithine and arginine. Mutations in this gene lead to hyperammonemia, hypoornithinemia, hypocitrullinemia, hypoargininemia and hypoprolinemia and may be associated with neurodegeneration, cataracts and connective tissue diseases. Alternatively spliced transcript variants, encoding different isoforms, have been described for this gene. [provided by RefSeq, Jul 2008]
    
    9. ALDH1B1
    This protein belongs to the aldehyde dehydrogenases family of proteins. Aldehyde dehydrogenase is the second enzyme of the major oxidative pathway of alcohol metabolism. This gene does not contain introns in the coding sequence. The variation of this locus may affect the development of alcohol-related problems. [provided by RefSeq, Jul 2008]
    
    10. ALDH1L1
    The protein encoded by this gene catalyzes the conversion of 10-formyltetrahydrofolate, nicotinamide adenine dinucleotide phosphate (NADP+), and water to tetrahydrofolate, NADPH, and carbon dioxide. The encoded protein belongs to the aldehyde dehydrogenase family. Loss of function or expression of this gene is associated with decreased apoptosis, increased cell motility, and cancer progression. There is an antisense transcript that overlaps on the opposite strand with this gene locus. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Jun 2012]
    
    11. ALDH1A2
    This protein belongs to the aldehyde dehydrogenase family of proteins. The product of this gene is an enzyme that catalyzes the synthesis of retinoic acid (RA) from retinaldehyde. Retinoic acid, the active derivative of vitamin A (retinol), is a hormonal signaling molecule that functions in developing and adult tissues. The studies of a similar mouse gene suggest that this enzyme and the cytochrome CYP26A1, concurrently establish local embryonic retinoic acid levels which facilitate posterior organ development and prevent spina bifida. Four transcript variants encoding distinct isoforms have been identified for this gene. [provided by RefSeq, May 2011]
    
    12. AGPS
    This gene is a member of the FAD-binding oxidoreductase/transferase type 4 family. It encodes a protein that catalyzes the second step of ether lipid biosynthesis in which acyl-dihydroxyacetonephosphate (DHAP) is converted to alkyl-DHAP by the addition of a long chain alcohol and the removal of a long-chain acid anion. The protein is localized to the inner aspect of the peroxisomal membrane and requires FAD as a cofactor. Mutations in this gene have been associated with rhizomelic chondrodysplasia punctata, type 3 and Zellweger syndrome. [provided by RefSeq, Jul 2008]
    
    13. ALDH9A1
    This protein belongs to the aldehyde dehydrogenase family of proteins. It has a high activity for oxidation of gamma-aminobutyraldehyde and other amino aldehydes. The enzyme catalyzes the dehydrogenation of gamma-aminobutyraldehyde to gamma-aminobutyric acid (GABA). This isozyme is a tetramer of identical 54-kD subunits. [provided by RefSeq, Jul 2008]
    
    14. ALDH4A1
    This protein belongs to the aldehyde dehydrogenase family of proteins. This enzyme is a mitochondrial matrix NAD-dependent dehydrogenase which catalyzes the second step of the proline degradation pathway, converting pyrroline-5-carboxylate to glutamate. Deficiency of this enzyme is associated with type II hyperprolinemia, an autosomal recessive disorder characterized by accumulation of delta-1-pyrroline-5-carboxylate (P5C) and proline. Alternatively spliced transcript variants encoding different isoforms have been identified for this gene. [provided by RefSeq, Jun 2009]
    
    15. ALDH6A1
    This gene encodes a member of the aldehyde dehydrogenase protein family. The encoded protein is a mitochondrial methylmalonate semialdehyde dehydrogenase that plays a role in the valine and pyrimidine catabolic pathways. This protein catalyzes the irreversible oxidative decarboxylation of malonate and methylmalonate semialdehydes to acetyl- and propionyl-CoA. Methylmalonate semialdehyde dehydrogenase deficiency is characterized by elevated beta-alanine, 3-hydroxypropionic acid, and both isomers of 3-amino and 3-hydroxyisobutyric acids in urine organic acids. Alternate splicing results in multiple transcript variants. [provided by RefSeq, Jun 2013]
    
    16. ALDH3B1
    This gene encodes a member of the aldehyde dehydrogenase protein family. Aldehyde dehydrogenases are a family of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The encoded protein is able to oxidize long-chain fatty aldehydes in vitro, and may play a role in protection from oxidative stress. Alternative splicing results in multiple transcript variants. [provided by RefSeq, Feb 2014]
    
    17. ALDH3B2
    This gene encodes a member of the aldehyde dehydrogenase family, a group of isozymes that may play a major role in the detoxification of aldehydes generated by alcohol metabolism and lipid peroxidation. The gene of this particular family member is over 10 kb in length. Altered methylation patterns at this locus have been observed in spermatozoa derived from patients exhibiting reduced fecundity. [provided by RefSeq, Aug 2017]
    
    18. ALDH16A1
    This gene encodes a member of the aldehyde dehydrogenase superfamily. The family members act on aldehyde substrates and use nicotinamide adenine dinucleotide phosphate (NADP) as a cofactor. This gene is conserved in chimpanzee, dog, cow, mouse, rat, and zebrafish. The protein encoded by this gene interacts with maspardin, a protein that when truncated is responsible for Mast syndrome. Alternatively spliced transcript variants encoding different isoforms have been found for this gene. [provided by RefSeq, Apr 2010]
    
    19. ALDH1L2
    This gene encodes a member of both the aldehyde dehydrogenase superfamily and the formyl transferase superfamily. This member is the mitochondrial form of 10-formyltetrahydrofolate dehydrogenase (FDH), which converts 10-formyltetrahydrofolate to tetrahydrofolate and CO2 in an NADP(+)-dependent reaction, and plays an essential role in the distribution of one-carbon groups between the cytosolic and mitochondrial compartments of the cell. Alternatively spliced transcript variants have been found for this gene.[provided by RefSeq, Oct 2010]
    
    20. ALDH8A1
    This gene encodes a member of the aldehyde dehydrogenase family of proteins. The encoded protein has been implicated in the synthesis of 9-cis-retinoic acid and in the breakdown of the amino acid tryptophan. This enzyme converts 9-cis-retinal into the retinoid X receptor ligand 9-cis-retinoic acid, and has approximately 40-fold higher activity with 9-cis-retinal than with all-trans-retinal. In addition, this enzyme has been shown to catalyze the conversion of 2-aminomuconic semialdehyde to 2-aminomuconate in the kynurenine pathway of tryptophan catabolism. [provided by RefSeq, Jul 2018]
    
    21. ALDH7A1P1
    None
    
    22. ALDH1L1-AS2
    None
    
    23. ALDH7A1P4
    None
    
    24. ALDH7A1P3
    None
    
    25. ALDH7A1P2
    None
    
    26. ALDH1A3-AS1
    None
    
    27. ALDH1A2-AS1
    None
    
    28. ALDH1L1-AS1
    None
    

    Example 3 – Receiving thousands of genes across all organisms (unfiltered)

    Setting organism=None in the provided Python function and max_gene_ids=10000 for the same query (gene_name='ALDH*') results in 9010 returned Gene IDs (i.e., 9,010 ALDH-family genes among all organisms in the Entrez Gene DB, currently).

    E.g.,:

    >>> gene_summaries = get_entrez_gene_summary("ALDH*", email, organism=None, max_gene_ids=10000)
    9010 gene IDs returned associated with gene ALDH*.
        Retrieving summary for 217...
        Retrieving summary for 216...
        Retrieving summary for 19378...
        Retrieving summary for 11669...
    [...]