pythonhtmlparsinglxmlhtml5lib

Obtaining position info when parsing HTML in Python


I'm trying to find a way to parse (potentially malformed) HTML in Python and, if a set of conditions are met, output that piece of the document with the position (line, column). The position information is what is tripping me up here. And to be clear, I have no need to build an object tree. I simply want to find certain pieces of data and their position in the original document (think of a spell checker, for example: 'word "foo" at line x, column y, is misspelled)'

As an example I want something like this (using ElementTree's Target API):

import xml.etree.ElementTree as ET

class EchoTarget:
    def start(self, tag, attrib):
        if somecondition():
            print "start", tag, attrib, self.getpos()
    def end(self, tag):
        if somecondition():
            print "end", tag, self.getpos()
    def data(self, data):
        if somecondition():
            print "data", repr(data), self.getpos()

target = EchoTarget()
parser = ET.XMLParser(target=target)
parser.feed("<p>some text</p>")
parser.close() 

However, as far as I can tell, the getpos() method (or something like it) doesn't exist. And, of course, that is using an XML parser. I want to parse potentially malformed HTML.

Interestingly, the HTMLParser class in the Python Standard Lib does offer support for obtaining the location info (with a getpos() method), but it is horrible at handling malformed HTML and has been eliminated as a possible solution. I need to parse HTML that exists in the real word without breaking the parser.

I'm aware of two HTML parsers that would work well at parsing malformed HTML, namely lxml and html5lib. And in fact, I would prefer to use either one of them over any other options available in Python.

However, as far as I can tell, html5lib offers no event API and would require that the document be parsed to a tree object. Then I would have to iterate through the tree. Of course, by that point, there is no association with the source document and all location information is lost. So, html5lib is out, which is a shame because it seems like the best parser for handling malformed HTML.

The lxml library offers a Target API which mostly mirrors ElementTree's, but again, I'm not aware of any way to access location information for each event. A glance at the source code offered no hints either.

lxml also offers an API to SAX events. Interestingly, Python's standard lib mentions that SAX has support for Locator Objects, but offers little documentation about how to use them. This SO Question provides some info (when using a SAX Parser), but I don't see how that relates to the limited support for SAX events that lxml provides.

Finally, before anyone suggests Beautiful Soup, I will point out that, as stated on the home page, "Beautiful Soup sits on top of popular Python parsers like lxml and html5lib". All it gives me is an object to extract data from with no connection to the original source document. Like with html5lib, all location info is lost by the time I have access to the data. I want/need raw access to the parser directly.

To expand on the spell checker example I mention in the beginning, I would want to check the spelling only of words in the document text (but not tag names or attributes) and may want to skip checking the content of specific tags (like the script or code tags). Therefore, I need a real HTML parser. However, I am only interested in the position of the misspelled words in the original source document when it comes to reporting the misspelled words and have no need to build a tree object. To be clear, this is only an example of one potential use. I may use it for something completely different but the needs would be essentially the same. In fact, I once built something very similar using HTMLParser, but never used it as the error handling wasn't going to work for that use case. That was years ago, and I seem to have lost that file somewhere along the line. I'd like to use lxml or html5lib instead this time around.

So, is there something I'm missing? I have a hard time believing that none of these parsers (aside from the mostly useless HTMLParser) have any way to access the position information. But if they do it must be undocumented, which seems strange to me.


Solution

  • After some additional research and more carefully reviewing of the source code of html5lib, I discovered that html5lib.tokenizer.HTMLTokenizer does retain partial position information. By "partial," I mean that it knows the line and column of the last character of a given token. Unfortunately, it does not retain the position of the start of the token (I suppose it could be extrapolated, but that feels like re-implementing much of the tokenizer in reverse--and no, using the end position of the previous won't work if there is white space between tokens).

    In any event, I was able to wrap the HTMLTokenizer and create an HTMLParser clone which mostly replicates the API. You can find my work here: https://gist.github.com/waylan/7d5b7552078f1abc6fac.

    However, as the tokenizer is only part of the parsing process implemented by html5lib, we loose the good parts of html5lib. For example, no normalization has been done at that stage in the process, so you get the raw (potentially invalid) tokens rather than a normalized document. As stated in the comments there, it is not perfect and I question whether it is even useful.

    In fact, I also discovered the the HTMLParser included in the Python standard library had been updated for Python 3.3 and no longer crashes hard on invalid input. As far as I can tell, it is better (for my use case) in that it does provide actually useful position info (as it always has). In all other respects, it is no better or worse that my wrapper of html5lib (except of course, that it has presumably received much more testing and is therefore more stable). Unfortunately, the update has not been back-ported to Python 2 or earlier Python 3 versions. Although, I don't imagine that would be all that difficult to do myself.

    In any event, I'v decided to move forward with HTMLParser in the standard library and reject my own wrapper around html5lib. You can see an early effort here which appears to work fine with minimal testing.


    According to the Beautiful Soup docs, HTMLParser was updated to support invalid input in Python 2.7.3 and 3.2.2, which is earlier than 3.3.