pythonbeautifulsoup

Why does BeautifulSoup output self-closing tags in HTML?


I've tried with 3 different parsers: lxml, html5lib, html.parser

All of them output invalid HTML:

>>> BeautifulSoup('<br>', 'html.parser')
<br/>
>>> BeautifulSoup('<br>', 'lxml')
<html><body><br/></body></html>
>>> BeautifulSoup('<br>', 'html5lib')
<html><head></head><body><br/></body></html>
>>> BeautifulSoup('<br>', 'html.parser').prettify()
'<br/>\n'

All of them have /> "self-closing" void tags.

How can I get BeautifulSoup to output HTML that has void tags without />?


Solution

  • Use the html5 formatter:

    If you pass in formatter="html5", it’s the same as formatter="html", but Beautiful Soup will omit the closing slash in HTML void tags like “br”:

    from bs4 import BeautifulSoup
    
    BeautifulSoup('<br>', 'html.parser').decode(formatter="html5")
    

    Which outputs:

    '<br>'