pythonbeautifulsoupncbi

Extracting a specified data in python from multi url link using bs4


Need help in extracting the data from the Main URL that redirects to the sub URL link in which the required data needs to grep.

Main Url = "https://www.ncbi.nlm.nih.gov/gene/{gene_id}" sub Url = "https://www.ncbi.nlm.nih.gov/gene/{unique_gene_id_from_remote_side}"

Where user defines the variable with the required gene_id [eg : APO3, SLC7A11]

[i.e main_url = https://www.ncbi.nlm.nih.gov/gene/term?=APO3 , this link redirects to a sub-link which has the id information the needs to grep sub_url = https://www.ncbi.nlm.nih.gov/gene/348 , from this link need grep the summary tag only ]

Main URl Sub URL

I am able to get them till the second URL but not able to grep the href tag from it and grep the summary

the code which I tried

import requests
from bs4 import BeautifulSoup

gen_ids = ['APOE','SLC7A11']

for gen in gen_ids:
    url = f"https://www.ncbi.nlm.nih.gov/gene/?term={gen}"
    print(url)
    r = requests.get(url)
    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'lxml')
    x = soup.find('div',class_='panel')
   
    h = soup.find('h4',class_='ncbi-doc-title')
    h1 = [a['href'] for a in h.find_all('a')]
    
   
    print(h)
    print(h1)
    

Solution

  • You can try like this.

    This will print the summary of all the sub-links.

    import requests
    from bs4 import BeautifulSoup
    
    gen_ids = ['APOE','SLC7A11']
    
    for gen in gen_ids:
        url = f"https://www.ncbi.nlm.nih.gov/gene/?term={gen}"
        print(url)
        r = requests.get(url)
        base_url = 'https://www.ncbi.nlm.nih.gov'
        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'lxml')    
        h1 = soup.find_all('td', class_='gene-name-id')
        links = [base_url + i.find('a')['href'] for i in h1]
    
        for i in links:
            soup = BeautifulSoup(requests.get(i).text, 'lxml')
            summary = soup.find('div', class_='rprt-section gene-summary')
            print(list(summary.stripped_strings))
    
    https://www.ncbi.nlm.nih.gov/gene/?term=APOE
    ['Summary', 'Go to the top of the page', 'Help', 'Official\n                         Symbol', 'APOE', 'provided by', 'HGNC', 'Official\n                         Full Name', 'apolipoprotein E', 'provided by', 'HGNC', 'Primary source', 'HGNC:HGNC:613', 'See related', 'Ensembl:ENSG00000130203', 'MIM:107741', 'Gene type', 'protein coding', 'RefSeq status', 'REVIEWED', 'Organism', 'Homo sapiens', 'Lineage', 'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo', 'Also known as', 'AD2; LPG; APO-E; ApoE4; LDLCQ5', 'Summary', 'The protein encoded by this gene is a major apoprotein of the chylomicron. It binds to a specific liver and peripheral cell receptor, and is essential for the normal catabolism of triglyceride-rich lipoprotein constituents. This gene maps to chromosome 19 in a cluster with the related apolipoprotein C1 and C2 genes. Mutations in this gene result in familial dysbetalipoproteinemia, or type III hyperlipoproteinemia (HLP III), in which increased plasma cholesterol and triglycerides are the consequence of impaired clearance of chylomicron and VLDL remnants. [provided by RefSeq, Jun 2016]', 'Expression', 'Biased expression in liver (RPKM 1021.7), kidney (RPKM 648.1) and 7 other tissues', 'See more', 'Orthologs', 'mouse', 'all', 'NEW', 'Try the new', 'Gene table', 'Try the new', 'Transcript table']
    
    
    https://www.ncbi.nlm.nih.gov/gene/?term=SLC7A11
    ['Summary', 'Go to the top of the page', 'Help', 'Official\n                         Symbol', 'SLC7A11', 'provided by', 'HGNC', 'Official\n                         Full Name', 'solute carrier family 7 member 11', 'provided by', 'HGNC', 'Primary source', 'HGNC:HGNC:11059', 'See related', 'Ensembl:ENSG00000151012', 'MIM:607933', 'Gene type', 'protein coding', 'RefSeq status', 'VALIDATED', 'Organism', 'Homo sapiens', 'Lineage', 'Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo', 'Also known as', 'xCT; CCBR1', 'Summary', 'This gene encodes a member of a heteromeric, sodium-independent, anionic amino acid transport system that is highly specific for cysteine and glutamate. In this system, designated Xc(-), the anionic form of cysteine is transported in exchange for glutamate. This protein has been identified as the predominant mediator of Kaposi sarcoma-associated herpesvirus fusion and entry permissiveness into cells. Also, increased expression of this gene in primary gliomas (compared to normal brain tissue) was associated with increased glutamate secretion via the XCT channels, resulting in neuronal cell death. [provided by RefSeq, Sep 2011]', 'Expression', 'Biased expression in brain (RPKM 12.7), thyroid (RPKM 4.9) and 8 other tissues', 'See more', 'Orthologs', 'mouse', 'all', 'NEW', 'Try the new', 'Gene table', 'Try the new', 'Transcript table']