pythonhtmllxmllxml.htmlpyquery

parse html body fragment in lxml


I'm trying to parse a fragment of html:

<body><h1>title</h1><img src=""></body>

I use lxml.html.fromstring. And it is driving me insane because it keeps stripping the <body> tag of my fragments:

 > lxml.html.fromstring('<html><h1>a</h1></html>').tag
 'html'
 > lxml.html.fromstring('<div><h1>a</h1></div>').tag
 'div'
 > lxml.html.fromstring('<body><h1>a</h1></body>').tag
 'h1'

I've also tried the document_fromstring, fragment_fromstring, clean_html with page_structure=False, etc... nothing works.

I need to use lxml, since I'm passing the html fragment to PyQuery.

I just want lxml to not mess with my html fragment. Is it possible to do that?


Solution

  • .fragment_fromstring() removes the <html> tag as well; basically, whenever you do not have a HTML document (with a <html> top-level element and/or a doctype), .fromstring() falls back to .fragment_fromstring() and that method removes both the <html> and the <body> tags, always.

    The work-around is to tell .fragment_fromstring() to give you a <body> parent tag:

    >>> lxml.html.fragment_fromstring('<body><h1>a</h1></body>', create_parent='body')
    <Element body at 0x10d06fbf0>
    

    This does not preserve any attributes on the original <body> tag.

    Another work-around is to use the .document_fromstring() method, which will wrap your document in a <html> tag, which you then can remove again:

    >>> lxml.html.document_fromstring('<body><h1>a</h1></body>')[0]
    <Element body at 0x10d06fcb0>
    

    This does preserve attributes on the <body>:

    >>> lxml.html.document_fromstring('<body class="foo"><h1>a</h1></body>')[0].attrib
    {'class': 'foo'}
    

    Using the .document_fromstring() function on your first example gives:

    >>> body = lxml.html.document_fromstring('<body><h1>title</h1><img src=""></body>')[0]
    >>> lxml.html.tostring(body)
    '<body><h1>title</h1><img src=""></body>'
    

    If you only want to do this if there is no HTML tag, do what lxml.html.fromstring() does and test for a full document:

    htmltest = lxml.html._looks_like_full_html_bytes if isinstance(inputtext, str) else lxml.html._looks_like_full_html_unicode
    if htmltest(inputtext):
        tree = lxml.html.fromstring(inputtext)
    else:
        tree = lxml.html.document_fromstring(inputtext)[0]