pythonxpath

Is there simpler way to get all nested text inside of ElementTree?


I am currently using the xml.etree Python library to parse HTML.

After finding a target DOM element, I am attempting to extract its text. Unfortunately, it seems that the .text attribute is severely limited in its functionality and will only return the immediate inner text of an element (and not anything nested). Do I really have to loop through all the children of the ElementTree? Or is there a more elegant solution?


Solution

  • You can use itertext(), too. If you don’t like the whitespaces, indention and line break you can use strip().

    import xml.etree.ElementTree as ET
    
    html = """<html>
        <head>
            <title>Example page</title>
        </head>
        <body>
            <p>Moved to <a href="http://example.org/">example.org</a>
            or <a href="http://example.com/">example.com</a>.</p>
        </body>
    </html>"""
    
    root = ET.fromstring(html)
    
    target_element = root.find(".//body")
    
    # get all text
    all_text = ''.join(target_element.itertext())
    
    # get all text and remove line break etc.
    all_text_clear = ' '.join(all_text.split())
    
    print(all_text)
    print(all_text_clear)
    

    Output:

            Moved to example.org
            or example.com.
        
    Moved to example.org or example.com.