pythonbeautifulsouphtml5lib

Don't put html, head and body tags automatically, beautifulsoup


I'm using beautifulsoup with html5lib, it puts the html, head and body tags automatically:

BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>

Is there any option that I can set, turn off this behavior ?


Solution

  • In [35]: import bs4 as bs
    
    In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
    Out[36]: <h1>FOO</h1>
    

    This parses the HTML with Python's builtin HTML parser. Quoting the docs:

    Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a <body> tag. Unlike lxml, it doesn’t even bother to add an <html> tag.


    Alternatively, you could use the html5lib parser and just select the element after <body>:

    In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
    
    In [62]: soup.body.next
    Out[62]: <h1>FOO</h1>