excelvbaweb-scrapinggetelementsbyclassname

Retrieve element from page


I want to retrieve the IUPAC name of a chemical on https://echa.europa.eu/brief-profile/-/briefprofile/100.000.685

I'd like the printed result to return Benzene in this example.

The code below pulls all elements with className.

Public Sub GetContents()
    
    Dim XMLReq As New MSXML2.XMLHTTP60
    Dim HTMLDoc As New MSHTML.HTMLDocument
    
    XMLReq.Open "Get", "https://echa.europa.eu/brief-profile/-/briefprofile/100.000.685", False
    XMLReq.send
     
    HTMLDoc.body.innerHTML = XMLReq.responseText

    Set IUPACName = HTMLDoc.getElementsByClassName("col-sm-8")(0)
    
    Debug.Print IUPACName.innerText

End Sub

This returns:

EC / List name:
IUPAC name: benzene
Substance names and other identifiers

Inspecting the page there doesn't seem to be an identifier to just return Benzene.

Text I wish to pull.
enter image description here


Solution

  • I can't test on other Office versions but 2019, at least, you can use an attribute selector as follows:

    Set IUPACName = HTMLDoc.querySelector("[title*=IUPAC]")
        
    Debug.Print IUPACName.innerText
    

    I was expecting to use:

    Debug.Print IUPACName.NextSibling.NodeValue
    

    So, that latter one maybe what you need on your Office version.

    The world of mshtml.dll is quite topsy-turvy as moment.