pythonxmlparsingsax

How can I stop SAX parsing?


I am using a SAX parser (xml.sax) and it works how I want to. However, I am parsing quite a large file (hence why I use SAX) and I would like to stop parsing at some point (e.g., when I reached a certain limit, or when I found a certain piece of data).

class ProductHandler(xml.sax.ContentHandler):
  def startElement(self, tag, attrs):
    [.. process start ..]

  def endElement(self, tag):
    [.. process end ..]

  def characters(self, content):
    [.. process characters ..]

product_handler = ProductHandler()
parser = xml.sax.make_parser()
parser.setContentHandler(product_handler)
parser.parse(xmlfile)

How do I do that? Is there a certain return value I can return at one of the handler methods? I checked the documentation, but I couldn't find anything in this direction.


Solution

  • Using this example data, if we want to find a <description> that contains the word "sourdough", maybe we would write something like this:

    import xml.sax
    
    
    class IAmAllDone(Exception):
        pass
    
    
    class ProductHandler(xml.sax.handler.ContentHandler):
        def __init__(self):
            super().__init__()
    
            self.description = None
            self.name = None
            self.tree = []
    
        def startElement(self, name, attrs):
            self.tree.append(name)
    
        def endElement(self, name):
            self.tree.pop(0)
    
        def characters(self, content):
            if self.tree[-1] == "name" and content.strip():
                self.name == content
                print("name:", content)
            elif self.tree[-1] == "description" and "sourdough" in content:
                self.description = content
                raise IAmAllDone()
    
    
    product_handler = ProductHandler()
    parser = xml.sax.make_parser()
    parser.setContentHandler(product_handler)
    try:
        parser.parse("data.xml")
    except IAmAllDone:
        pass
    
    if product_handler.description is not None:
        print("found description:", product_handler.description)
    

    The above will output:

    name: Belgian Waffles
    name: Strawberry Belgian Waffles
    name: Berry-Berry Belgian Waffles
    name: French Toast
    found description: Thick slices made from our homemade sourdough bread
    

    As you can see, we stop the SAX parsing before reading the final "Homestyle Breakfast" item.