pythonxmlsax

Python3 SAX - parsing internal entity names instead of resolving them


I want to parse the an XML document (the JMdict dictionary), which is a list of entries containing elements which represent properties about the entry, for example <pos>&vr;</pos>, where the internal entity vr is declared at the beginning of the XML document as <!ENTITY vr "irregular ru verb, plain form ends with -ri">, where the entity value is the human readable form.

My current code, written in Python 3.11 with the xml.sax package, retrieves the property values by implementing the characters() method of ContentHandler. Because the entities are resolved by the parser, by doing so i retrieve the human readable form.

If, for example, in my pipeline i now want to filter out all entries which are irregular ru verbs, instead of checking that they contain &vr; i have to check that they contain irregular ru verb, plain form ends with -ri, which is cumbersome and might not even be guaranteed to be correct across different versions of the dictionary.

How can i retrieve the entity name instead of the entity value?

Because i couldn't find out how to disable the entity resolution, i instead tried handling the events described by the EntityResolver and DTDHandler interfaces to try and see where that would lead me, but they were never called. Citing the answers of this post, they are called only for external entities.


Solution

  • lxml.etree.iterparse has this feature built-in by using resolve_entities = False.

    I rewrote my parser with LXML.