The BZ2 file I'm using is a partial dump of Wikipedia [here]
Here's a Python code I wrote to test the length of a 10000-byte block before and after decompression:
import bz2
with open('enwiki-20231020-pages-articles-multistream1.xml-p1p41242.bz2', 'rb') as f:
block = f.read(10000)
print(len(block))
block = bz2.BZ2Decompressor().decompress(block)
print(len(block))
It outputs:
10000
2560
Indicating that the decompressor is somehow shrinking the block? How is this possible? Everywhere I searched, it's telling me this shouldn't be happening.
This is because a bzip2 file may be a concatenation of multiple compressed streams, and bz2.BZ2Decompressor
decompresses only the first stream from the input data.
Excerpt from the documentation of bz2.BZ2Decompressor
:
Note: This class does not transparently handle inputs containing multiple compressed streams, unlike
decompress()
andBZ2File
. If you need to decompress a multi-stream input with BZ2Decompressor, you must use a new decompressor for each stream.
In your example, the first stream is 2560 bytes long after decompression, and the second stream begins at what's left of the buffer after the decompression of the first stream, stored in the unused_data
attribute of the decompressor instance, which you can decompress by instantiating a new bz2.BZ2Decompressor
instance as noted in the documentation.
You can therefore implement code that decompresses an entire bzip2 file in 10000-byte chunks by iteratively reading from either unused_data
of the current decompressor instance or the next chunk of the file:
import bz2
decompressed = []
with open('enwiki-20231020-pages-articles-multistream1.xml-p1p41242.bz2', 'rb') as f:
decompressor = bz2.BZ2Decompressor()
while chunk := decompressor.unused_data or f.read(10000):
if decompressor.eof:
decompressor = bz2.BZ2Decompressor()
decompressed.append(decompressor.decompress(chunk))
print(sum(map(len, decompressed)))
This outputs the total size of the uncompressed data of the given sample bzip2 file:
1018211968
And the actual content of the entire decompressed data will be:
b''.join(decompressed)