pythondjangohtml5lib

Remove contents of <style>...</style> tags using html5lib or bleach


I've been using the excellent bleach library for removing bad HTML.

I've got a load of HTML documents which have been pasted in from Microsoft Word, and contain things like:

<STYLE> st1:*{behavior:url(#ieooui) } </STYLE>

Using bleach (with the style tag implicitly disallowed), leaves me with:

st1:*{behavior:url(#ieooui) }

Which isn't helpful. Bleach seems only to have options to:

I'm looking for a third option - remove the tags and their contents.

Is there any way to use bleach or html5lib to completely remove the style tag and its contents? The documentation for html5lib isn't really a great deal of help.


Solution

  • It turned out lxml was a better tool for this task:

    from lxml.html.clean import Cleaner
    
    def clean_word_text(text):
        # The only thing I need Cleaner for is to clear out the contents of
        # <style>...</style> tags
        cleaner = Cleaner(style=True)
        return cleaner.clean_html(text)