pythonhtml-parsinglxmliterparse

Iteratively parsing HTML (with lxml?)


I'm currently trying to iteratively parse a very large HTML document (I know.. yuck) using lxml.etree.iterparse:

Incremental parser. Parses XML into a tree and generates tuples (event, element) in a SAX-like fashion

I am using an incremental/iterative/SAX approach to reduce the amount of memory used (I don't want to load the HTML into a DOM/tree because the file is large)

The problem I'm having is that I'm getting XML syntax errors such as:

lxml.etree.XMLSyntaxError: Attribute name redefined, line 134, column 59

This then causes everything to stop.

Is there a way to iteratively parse HTML without choking on syntax errors?

At the moment I'm extracting the line number from the XML syntax error exception, removing that line from the document, and then restarting the process. Seems like a pretty disgusting solution. Is there a better way?

Edit:

This is what I'm currently doing:

context = etree.iterparse(tfile, events=('start', 'end'), html=True)
in_table = False
header_row = True
while context:
    try:
        event, el = context.next()
        
        # do something

        # remove old elements
        while el.getprevious() is not None:
            del el.getparent()[0]

    except etree.XMLSyntaxError, e:
        print e.msg
        lineno = int(re.search(r'line (\d+),', e.msg).group(1))
        remove_line(tfilename, lineno)
        tfile = open(tfilename)
        context = etree.iterparse(tfile, events=('start', 'end'), html=True)
    except KeyError:
        print 'oops keyerror'

Solution

  • The perfect solution ended up being Python's very own HTMLParser [docs].

    This is the (pretty bad) code I ended up using:

    class MyParser(HTMLParser):
        def __init__(self):
            self.finished = False
            self.in_table = False
            self.in_row = False
            self.in_cell = False
            self.current_row = []
            self.current_cell = ''
            HTMLParser.__init__(self)
    
        def handle_starttag(self, tag, attrs):
            attrs = dict(attrs)
            if not self.in_table:
                if tag == 'table':
                    if ('id' in attrs) and (attrs['id'] == 'dgResult'):
                        self.in_table = True
            else:
                if tag == 'tr':
                    self.in_row = True
                elif tag == 'td':
                    self.in_cell = True
                elif (tag == 'a') and (len(self.current_row) == 7):
                    url = attrs['href']
                    self.current_cell = url
    
    
        def handle_endtag(self, tag):
            if tag == 'tr':
                if self.in_table:
                    if self.in_row:
                        self.in_row = False
                        print self.current_row
                        self.current_row = []
            elif tag == 'td':
                if self.in_table:
                    if self.in_cell:
                        self.in_cell = False
                        self.current_row.append(self.current_cell.strip())
                        self.current_cell = ''
    
            elif (tag == 'table') and self.in_table:
                self.finished = True
    
        def handle_data(self, data):
            if not len(self.current_row) == 7:
                if self.in_cell:
                    self.current_cell += data
    

    With that code I could then do this:

    parser = MyParser()
    for line in myfile:
        parser.feed(line)