pythoncompressionbzip

parsing large compressed xml files, python


file  = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file)

Here's code that tries to parse xml file compressed with bz2. Unfortunately it fails with a message:

TypeError: Parse() argument 1 must be string or read-only buffer, not bz2.BZ2File

Is there a way to parse on the fly compressed bz2 xml files?

Note: p.Parse(file.read()) is not an option here. I want to parse a file which is larger than available memory, so I need to have a stream.


Solution

  • Just use p.ParseFile(file) instead of p.Parse(file).

    Parse() takes a string, ParseFile() takes a file handle, and reads the data in as required.

    Ref: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile