I have an XHTML file that is structured like this:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>
I'm using BeautifulSoup and I want to remove the XML declaration from the document, so what I have looks like this:
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html>
I can't find a way to get at the XML declaration to remove it. It doesn't appear to be a Doctype, Declaration, Tag, or NavigableString as far as I can tell. Is there a way I can find this to extract it?
As a working example, I can remove the Doctype with code like this (assuming the document text is the variable "html"):
soup = BeautifulSoup(html)
[item.extract() for item in soup.contents if isinstance(item, Doctype)]
You could use the following approach:
import bs4
soup = bs4.BeautifulSoup(html, 'html.parser')
for e in soup:
if isinstance(e, bs4.element.ProcessingInstruction):
e.extract()
break
print(soup)
For your sample, this would give you the updated HTML as:
<!DOCTYPE html>
<html lang="en">
<head>
...
</head>
<body>
...
</body>
<html></html></html>