pythonwindowspython-2.7anacondabz2

Decompressing bz2 files on Windows


I am trying to decompress a bz2 file with below code snippet which is provided in various places:

bz2_data = bz2.BZ2File(DATA_FILE+".bz2").read()
open(DATA_FILE, 'wb').write(bz2_data)

However, I am getting a much smaller file than I expect.

When I extract the file with 7z GUI I am receiving a file with a size of 248MB. However, with above code the file I get is 879kb.

When I read the extracted XML file, I can see that rest of the file is missing as I expect.

I am running anaconda on Windows machine, and as far as understand bz2 reaches an EOF before file actually ends.

By the way, I already run into this and this both did no good.


Solution

  • If this is a multi-stream file, then Python's bz2 module (before 3.3) doesn't support it:

    Note This class does not support input files containing multiple streams (such as those produced by the pbzip2 tool). When reading such an input file, only the first stream will be accessible. If you require support for multi-stream files, consider using the third-party bz2file module (available from PyPI). This module provides a backport of Python 3.3’s BZ2File class, which does support multi-stream files.

    An alternative, drop-in replacement: bz2file should work though.