pythonlxmlwbr

Removing <wbr> tags and grabbing the info between


I'm scrapping data from a webpage and have done so for a certain section that has the <br> tag.

<div class="scrollWrapper">
    <h3>Smiles</h3>
    CC=O<br>
    <button type="button" id="downloadSmiles">Download</button>
</div>

I solved this problem by doing the below script to output CC=O.

from lxml import html

page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
tree = html.fromstring(page.text)
if ("Smiles" in page.text):
        smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
else:
        smiles = ""

However, as I was browsing through other pages of different chemicals I encountered some pages that had the tag in them. I have no idea how to get rid of them while grabbing the information between them. An example is shown below with my desired output to be c1(c2ccccc2)ccc(N)cc1.

<div class="scrollWrapper">
   <h3>Smiles</h3>
   c1(c2ccccc2)<wbr>ccc(N)<wbr>cc1<br>
   <button type="button" id="downloadSmiles">Download</button>
</div>

Solution

  • The easiest thing to do would be to replace <wbr> string in the page.text with empty string, before you parse it into html. Since its within < and > I doubt if any of the useful info you are looking for would have it.

    Example -

    from lxml import html
    
    page = requests.get('http://chem.sis.nlm.nih.gov/chemidplus/name/'+ substance)
    tree = html.fromstring(page.text.replace('<wbr>',''))
    if ("Smiles" in page.text):
            smiles = tree.xpath('normalize-space(//*[text()="Smiles"]/..//br[1]/preceding-sibling::text()[1])')
    else:
            smiles = ""
    

    Otherwise you can use @Bun's solution of using BeautifulSoup , or write complex xpaths.

    Also, an easier xpath for your case should be -

    'normalize-space(//*[text()="Smiles"]/following-sibling::text()[1])'
    

    Rather than finding out the Smiles, element and then taking its parent then find out the first br element that is its descendent then taking its preceding sibling and then its text.

    You should directly take the following sibling for the Smiles element and then its text.