bzip2compressionlibzip

How to decompress pbzip2 data in memory buffer by using libbz2 library in C++


I have a working version of decompressing bzip2 data where I call the bz2_bzdecompress API. It goes something like this

while (bytes_input < len) {
    isDone = false;

    // Initialize the input buffer and its length
    size_t in_buffer_size = len -bytes_input;
    the_bz2_stream.avail_in = in_buffer_size;
    the_bz2_stream.next_in = (char*)data +bytes_input;

    size_t out_buffer_size =
        output_size -bytes_uncompressed;  // size of output buffer
    if (out_buffer_size == 0) {  // out of space in the output buffer
      break;
    }

    the_bz2_stream.avail_out = out_buffer_size;
    the_bz2_stream.next_out =
        (char*)output +bytes_uncompressed;  // output buffer

    ret = BZ2_bzDecompress(&the_bz2_stream);
    if (ret != BZ_OK && ret != BZ_STREAM_END) {
      throw Bzip2Exception("Bzip2 failed. ", ret);
    }

   bytes_input += in_buffer_size - the_bz2_stream.avail_in;
   bytes_uncompressed += out_buffer_size - the_bz2_stream.avail_out;

    *data_consumed =bytes_input;

    if (ret == BZ_STREAM_END) {
      ret = BZ2_bzDecompressEnd(&the_bz2_stream);
      if (ret != BZ_OK) {
        throw Bzip2Exception("Bzip2 fail. ", ret);
      }
      isDone = true;
    }
  }

This works great for native bzip2 compressed files, but for pbzip2 (Parallel Bzip2) and "Splittable" bzip2 data, it throws a "BZ_PARAM_ERROR".

I see that pbzip2 in their documentation says this-

Data compressed with pbzip2 is broken into multiple streams and each stream is bzip2 compressed looking like this: [-----|-----|-----|-----|-----|-----|-----|-----|-----]

If you are writing software with libbzip2 to decompress data created with pbzip2, you must take into account that the data contains multiple bzip2 streams so you will encounter end-of-stream markers from libbzip2 after each stream and must look-ahead to see if there are any more streams to process before quitting. The bzip2 program itself will automatically handle this condition.

Source:http://compression.ca/pbzip2/

Can someone please tell me how to handle this? Should I be using some other libzip2 API?

Also, pbzip2 files are compatible with the normal "bunzip2" command. How is that bzip2 handles this gracefully while my code throws a BZ_PARAM_ERROR?

Thanks.


Solution

  • After your BZ2_bzDecompressEnd() you need to call BZ2_bzDecompressInit() again (you must have called it initially before that loop), if there is still data left to decompress, i.e. bytes_input < len.

    To decompress each of the |-----| blocks, you need to do an init, some number of decompress calls, and an end. So if you still have input left, then you need to do another init, n*decompress, end.

    Make sure that you do a final end, in order to avoid a big memory leak.

    You're getting a BZ_PARAM_ERROR because you are trying to use an uninitialized bz_stream to decompress. Once you do BZ2_bzDecompressEnd(), you can't use that bz_stream any more, unless you do a BZ2_bzDecompressInit() on it.