pythonhtmllxml

LXML: get text inbetween elements children


I have a badly structured html template, where my <section> elements contain multiple elements (p, figure, a, etc), but also raw text in between. How can I access all those snippets of texts, and edit them in place (what I need is to replace all $$code$$ with tags?) both section.text and section.tail return empty strings...


Solution

  • Examine the .tail of the complete tag that immediately precedes the text. So, in <section>A<p>B</p>C<p>D</p>E</section>, the .tails of the two <p> elemnts will contain C and E.

    Example:

    from lxml import etree
    
    root = etree.fromstring('<root><section>A<p>B</p>C<p>D</p>E</section></root>')
    
    for section_child in root.find('section'):
        section_child.tail = section_child.tail.lower()
    
    print(etree.tounicode(root))
    

    Result:

    <root><section>A<p>B</p>c<p>D</p>e</section></root>