python-3.xxpathlxml.html

Using XPath, select node without text sibling


I want to extract some HTML elements with python3 and the HTML parser provided by lxml.

Consider this HTML:

<!DOCTYPE html>
<html>
  <body>
    <span class="foo">
      <span class="bar">bar</span>
      foo
    </span>
  </body>
</html>

Consider this program:

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[@class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))

In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:

[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo

It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?


Solution

  • Notice that XML tree model in lxml (as well as in the standard module xml.etree) has concept of tail. So text nodes located after a.k.a following-sibling of element will be stored as tail of that element. So your XPath correctly return the span element, but according to the tree model, it has tail which holds the text 'foo'.

    As a workaround, assuming that you don't want to use the tree model further, simply clear the tail before printing:

    >>> bars[0].tail = ''
    >>> print(html.tostring(bars[0], encoding="unicode"))
    <span class="bar">bar</span>