pythonlxml.html

How to use Cleaner, lxml.html without returning div tag?


I have this code:

evil = "<script>malignus script</script><b>bold text</b><i>italic text</i>"
cleaner = Cleaner(remove_unknown_tags=False, allow_tags=['p', 'br', 'b'],
                  page_structure=True)
print cleaner.clean_html(evil)

I expected to get this:

<b>bold text</b>italic text

But instead I'm getting this:

<div><b>bold text</b>italic text</div>

Is there an attribute to remove the div tag wrapper?


Solution

  • lxml expects your html to have a tree structure, ie a single root node. If it does not have one, it adds it.