pythonhtmlhtml-parsinghtml5lib

Why is text of HTML node empty with HTMLParser?


In the following example I am expecting to get Foo for the <h2> text:

from io import StringIO
from html5lib import HTMLParser

fp = StringIO('''
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <body>
        <h2>
            <span class="section-number">1. </span>
            Foo
            <a class="headerlink" href="#foo">¶</a>
        </h2>
    </body>
</html>
''')

etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
h2 = etree.findall('.//h2')[0]

h2.text

Unfortunately I get ''. Why?

Strangly, foo is in the text:

>>> list(h2.itertext())
['1. ', 'Foo', '¶']

>>> h2.getchildren()
[<Element 'span' at 0x7fa54c6a1bd8>, <Element 'a' at 0x7fa54c6a1c78>]

>>> [node.text for node in h2.getchildren()]
['1. ', '¶']

So where is Foo?


Solution

  • I think you are one level too shallow in the tree. Try this:

    from io import StringIO
    from html5lib import HTMLParser
    
    fp = StringIO('''
    <!DOCTYPE html>
    <html xmlns="http://www.w3.org/1999/xhtml">
        <body>
            <h2>
                <span class="section-number">1. </span>
                Foo
                <a class="headerlink" href="#foo">¶</a>
            </h2>
        </body>
    </html>
    ''')
    
    etree = HTMLParser(namespaceHTMLElements=False).parse(fp)
    etree.findall('.//h2')[0][0].tail
    

    More generally, to crawl all text and tail, try a loop like this:

    for u in etree.findall('.//h2')[0]:
        print(u.text, u.tail)