I'm using beautifulsoup with html5lib, it puts the html
, head
and body
tags automatically:
BeautifulSoup('<h1>FOO</h1>', 'html5lib') # => <html><head></head><body><h1>FOO</h1></body></html>
Is there any option that I can set, turn off this behavior ?
In [35]: import bs4 as bs
In [36]: bs.BeautifulSoup('<h1>FOO</h1>', "html.parser")
Out[36]: <h1>FOO</h1>
This parses the HTML with Python's builtin HTML parser. Quoting the docs:
Unlike html5lib, this parser makes no attempt to create a well-formed HTML document by adding a
<body>
tag. Unlike lxml, it doesn’t even bother to add an<html>
tag.
Alternatively, you could use the html5lib
parser and just select the element after <body>
:
In [61]: soup = bs.BeautifulSoup('<h1>FOO</h1>', 'html5lib')
In [62]: soup.body.next
Out[62]: <h1>FOO</h1>