I have a working version of decompressing bzip2 data where I call the bz2_bzdecompress API. It goes something like this
while (bytes_input < len) {
isDone = false;
// Initialize the input buffer and its length
size_t in_buffer_size = len -bytes_input;
the_bz2_stream.avail_in = in_buffer_size;
the_bz2_stream.next_in = (char*)data +bytes_input;
size_t out_buffer_size =
output_size -bytes_uncompressed; // size of output buffer
if (out_buffer_size == 0) { // out of space in the output buffer
break;
}
the_bz2_stream.avail_out = out_buffer_size;
the_bz2_stream.next_out =
(char*)output +bytes_uncompressed; // output buffer
ret = BZ2_bzDecompress(&the_bz2_stream);
if (ret != BZ_OK && ret != BZ_STREAM_END) {
throw Bzip2Exception("Bzip2 failed. ", ret);
}
bytes_input += in_buffer_size - the_bz2_stream.avail_in;
bytes_uncompressed += out_buffer_size - the_bz2_stream.avail_out;
*data_consumed =bytes_input;
if (ret == BZ_STREAM_END) {
ret = BZ2_bzDecompressEnd(&the_bz2_stream);
if (ret != BZ_OK) {
throw Bzip2Exception("Bzip2 fail. ", ret);
}
isDone = true;
}
}
This works great for native bzip2 compressed files, but for pbzip2 (Parallel Bzip2) and "Splittable" bzip2 data, it throws a "BZ_PARAM_ERROR".
I see that pbzip2 in their documentation says this-
Data compressed with pbzip2 is broken into multiple streams and each stream is bzip2 compressed looking like this: [-----|-----|-----|-----|-----|-----|-----|-----|-----]
If you are writing software with libbzip2 to decompress data created with pbzip2, you must take into account that the data contains multiple bzip2 streams so you will encounter end-of-stream markers from libbzip2 after each stream and must look-ahead to see if there are any more streams to process before quitting. The bzip2 program itself will automatically handle this condition.
Source:http://compression.ca/pbzip2/
Can someone please tell me how to handle this? Should I be using some other libzip2 API?
Also, pbzip2 files are compatible with the normal "bunzip2" command. How is that bzip2 handles this gracefully while my code throws a BZ_PARAM_ERROR?
Thanks.
After your BZ2_bzDecompressEnd()
you need to call BZ2_bzDecompressInit()
again (you must have called it initially before that loop), if there is still data left to decompress, i.e. bytes_input < len
.
To decompress each of the |-----|
blocks, you need to do an init
, some number of decompress
calls, and an end
. So if you still have input left, then you need to do another init
, n*decompress
, end
.
Make sure that you do a final end
, in order to avoid a big memory leak.
You're getting a BZ_PARAM_ERROR
because you are trying to use an uninitialized bz_stream
to decompress. Once you do BZ2_bzDecompressEnd()
, you can't use that bz_stream
any more, unless you do a BZ2_bzDecompressInit()
on it.