htmlhtml-parsinglxmlhtml5lib

lxml html5parser ignores "namespaceHTMLElements=False" option


The lxml html5parser seems to ignore any namespaceHTMLElements=False option I pass to it. It puts all elements I give it into the HTML namespace instead of the (expected) void namespace.

Here’s a simple case that reproduces the problem:

echo "<p>" | python -c "from sys import stdin; \
  from lxml.html import html5parser as h5, tostring; \
  print tostring(h5.parse(stdin, h5.HTMLParser(namespaceHTMLElements=False)))"

The output from that is this:

<html:html xmlns:html="http://www.w3.org/1999/xhtml"><html:head></html:head><html:body><html:p>
</html:p></html:body></html:html>

As can be seen, the html element and all other elements there are in the HTML namespace.

The expected output is instead this:

<html><head></head><body><p>
</p></body></html>

I recognize that namespaceHTMLElements is an html5lib option, not a native lxml option that lxml does anything itself with directly. lxml is supposed to just call html5lib and pass that option on to html5lib in such a way that html5lib uses it as expected.


Update 2016-02-17

I still haven’t found a way to get the lxml html5parser to honor the namespaceHTMLElements. But to be clear, the alternative is to instead just call html5lib directly, like this:

echo "<p>" | python -c "from sys import stdin; \
import html5lib; from lxml import html; \
doc = html5lib.parse(stdin, treebuilder='lxml', namespaceHTMLElements=False); \
print html.tostring(doc)"

More details

Some things I already know:


Conclusion about where the cause is to be found

Given the above, it’s clear that the problem is in the interface between lxml and html5lib. I’m not sure why lxml is calling into html5lib twice but I think it may be because for some reason it first tries to create a new instance of its own XHTMLParser before doing what I’m actually asking it to do, which is just to create an instance of its own HTMLParser.

So maybe the fact that it does make two calls to html5lib causes html5lib to sort of “lock in” the default namespaceHTMLElements=True behavior that results from the first call, and then ignore the namespaceHTMLElements=False directive when it sees it in the second call.

Maybe in making two calls the way it does, lxml is either breaking some assumption in html5lib, or is actually misusing the html5lib API in a way that it by design is not intended to be used.

Or maybe the cause isn’t at all the result of lxml making two separate calls to html5lib, but instead some other problem in the way it’s using the html5lib interface.

Anyway, I’m interested in hearing from others about whether anybody else has run into this problem and has a workaround—or at least have some insight into why it’s happening.


Solution

  • I have followed in the source-code, how lxml hands params to html5lib. Most of the functions have a finishing *kws, which is then handed to the next function. In one of the last steps when calling the actual html5 parser, this is dropped and the parser is called with 2 fixed params.

    (I had the same problem yesterday, and just got to this question, and forgot the tiny details, allow me to forgo any code-snippets, and references.)

    Anyway, this confirms that in 2018, calling the html5lib directly with is still the preferred way, if calling lxml's own parser is not an option for some reason.

    (My use-case was: parse crappy html and have xpath.)