pythonbeautifulsouppubchem

How to extract 'Odor' information from PubChem using BeautifulSoup


I wrote the following Python code extract 'odor' information from PubChem for a particular molecule; in this case molecule nonanal (CID=31289) The webpage for this molecule is: https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor

import requests
from bs4 import BeautifulSoup

url = 'https://pubchem.ncbi.nlm.nih.gov/compound/31289#section=Odor'
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
odor_section = soup.find('section', {'id': 'Odor'})
odor_info = odor_section.find('div', {'class': 'section-content'})

print(odor_info.text.strip())

I get the following error. AttributeError: 'NoneType' object has no attribute 'find' It seems that not the whole page information is extracted by BeautifulSoup.

I expect the following output: Orange-rose odor, Floral, waxy, green


Solution

  • The page in question makes an AJAX request to load its data. We can see this in a web browser by looking at the Network tab of the dev tools (F12 in many browsers):

    enter image description here

    enter image description here

    That is to say, the data simply isn't there when the initial page loads - so it isn't found by BeautifulSoup.

    To solve the problem:

    PubChem_Nonanal_CID=31289
    coumpund_data_url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/{}/JSON/'
    compound_info = requests.get(coumpund_data_url.format(PubChem_Nonanal_CID))
    
    print (compund_info.json())
    

    Parsing the JSON Reply

    Parsing it proves a bit of a challenge, as it is comprised of many lists. If the order of properties isn't guaranteed, you could opt for a solution like this:

    for section in compund_info.json()['Record']['Section']:
        if section['TOCHeading']=="Chemical and Physical Properties":
           for sub_section in section['Section']:
               if sub_section['TOCHeading'] == 'Experimental Properties':
                   for sub_sub_section in sub_section['Section']:
                       if sub_sub_section['TOCHeading']=="Odor":
                           print(sub_sub_section['Information'][0]['Value']['StringWithMarkup'][0]['String'])
                           break
    
    

    Otherwise, follow the schema from a JSON-parsing website like jsonformatter.com

    enter image description here

    # object►Record►Section►3►Section►1►Section►2►Information►0►Value►StringWithMarkup►0►String`
    
    odor = compund_info.json()['Record']['Section'][3]['Section'][1]['Section'][2]['Information'][0]['Value']['StringWithMarkup'][0]['String']