htmlpython-3.xweb-scrapingxpathxml.etree

Python - Getting the text of a link with Etree using Xpath


I'm trying to get the text "Former United States Secretary Of State" out of this tag. I've tried many ways but cannot seem to get it.

<div class="tag"><a href="en/profession/748/former-united-states-secretary-of-state" class="">Former United States Secretary Of State</a></div>

This is my code:

site_content = etree.HTML(result)
selection = site_content.xpath(xpath_select)
content = [item.strip() for item in selection]

Every other xpath is working. This is the xpath I'm using as there are multiple of this one tag on the page "/html/body/div[5]/div[4]/div[5]/div[*]"

Any help in right direction would be greatly appreciated.

Working url = https://www.blackandwhitequotes.com/en/quotes/william-jennings-bryan_1182154_1&key=2OP8jfJC1D


Solution

  • Your XPath doesn't seem to be valid for your HTML example.

    In general when building XPaths it's best to rely on classes and identifiers rather than tree structure. So, we should write //div[contains(@class,"tag")] instead of //div/div/div[0] etc.

    In your case you can also use //text() XPath function to select all of the inner text of your node:

    from lxml import etree
    
    html = """<div class="tag"><a href="en/profession/748/former-united-states-secretary-of-state" class="">Former United States Secretary Of State</a></div>"""
    tree = etree.HTML(html)
    print(tree.xpath("//div[contains(@class,'tag')]//text()")[0])
    #'Former United States Secretary Of State'
    

    Looking for a div with class of tag will be much more reliable way of parsing this HTML than /html/body/div[5]/div[4]/div[5]/div[*]