In the following example I am expecting to get Foo
for the <h2>
text:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]
h2.text
Unfortunately I get ''
. Why?
Strangly, foo is in the text:
>>> list(h2.itertext())
['1. ', 'Foo', '¶']
>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]
>>> [node.text for node in h2.getchildren()]
['1. ', '¶']
So where is Foo
?
I think you are one level too shallow in the tree. Try this:
from io import StringIO
from html5lib import HTMLParser
fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<body>
<h2>
<span class="section-number">1. </span>
Foo
<a class="headerlink" href="#foo">¶</a>
</h2>
</body>
</html>
''')
etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
etree.findall('.//h2')[0][0].tail
More generally, to crawl all text and tail, try a loop like this:
for u in etree.findall('.//h2')[0]:
print(u.text, u.tail)