Remove suspicious comments from html mails with bs4. Now i encountered a problem with so called conditional comments
of type downlevel-revealed
.
import bs4
html = 'A<!--[if expression]>a<![endif]-->' \
'B<![if expression]>b<![endif]>'
soup = bs4.BeautifulSoup(html, 'html5lib')
for comment in soup.find_all(text=lambda text: isinstance(text, bs4.Comment)):
comment.extract()
'A',
'[if expression]>a<![endif]',
'B',
'[if expression]',
'b',
'[endif]',
'A',
'B',
'b',
The small b should also be removed. Problem is, bs4 detects first comment as one single comment object, but second is detected as 3 objects. Comment(if), NavigableString(b) and Comment(endif). Extraction just removes the both comment types. NavigableString with content 'b' remains in DOM.
Any solution to this?
After some time of reading about conditional comments i can understand why this is happening this way.
downlevel-hidden
are basically written as normal comment <!-- ... -->
. This is detected as conditional comment block in modern browsers. So BeautifulSoup removes it completely if i like to remove conditional comments.
downlevel-revealed
are written as <!...>b<!...>
, modern browsers detect the two tags as invalid and ignore them in DOM, so just b
remains valid. So BeautifulSoup removes only the tags, not the content
BeautifulSoup handles conditional comments as modern browsers would do. This is perfectly fine for my circumstances.