c++boostbzip2bz2

Boost 1.59 not decompressing all bzip2 streams


I've been trying to decompress some .bz2 files on the fly and line-by-line so to speak as the files I'm dealing with are massive uncompressed (region of 100 GB uncompressed) so I wanted to add a solution that saves disk space.

I have no problems decompressing using files compressed with vanilla bzip2 but files compressed with pbzip2 only decompress the first bz2 stream it finds. This bugtracker relates to the problem: https://svn.boost.org/trac/boost/ticket/3853 but I was lead to believe it was fixed past version 1.41. I've checked the bzip2.hpp file and it contains the 'fixed' version and I've also checked that the version of Boost used in the program is 1.59.

The code is here:

cout<<"Warning bzip2 support is a little buggy!"<<endl;

//Open the file here
trans_file.open(files[i].c_str(), std::ios_base::in |  std::ios_base::binary);

//Set up boost bzip2 compression
boost::iostreams::filtering_istream in;
in.push(boost::iostreams::bzip2_decompressor());
in.push(trans_file);
std::string str;

//Begin reading
while(std::getline(in, str))
{
    std::stringstream stream(str);
    stream>>id_f>>id_i>>aif;
    /* Do stuff with values here*/
}

Any suggestions would be great. Thanks!


Solution

  • You are right.

    It seems that changeset #63057 only fixes part of the issue.

    The corresponding unit-test does work, though. But it uses the copy algorithm (also on a composite<> instead of a filtering_istream, if that is relevant).

    I'd open this as a defect or a regression. Include a file that exhibits the problem, of course. For me it's reproduced using just /etc/dictionaries-common/words compressed with pbzip2 (default options).

    I have the test.bz2 here: http://7f0d2fd2-af79-415c-ab60-033d3b494dc9.s3.amazonaws.com/test.bz2

    Here's my test program:

    #include <boost/iostreams/filtering_stream.hpp>
    #include <boost/iostreams/filter/bzip2.hpp>
    #include <boost/iostreams/stream.hpp>
    #include <fstream>
    #include <iostream>
    
    namespace io = boost::iostreams;
    
    void multiple_member_test(); // from the unit tests in changeset #63057
    
    int main() {
        //multiple_member_test();
        //return 0;
    
        std::ifstream trans_file("test.bz2", std::ios::binary);
    
        //Set up boost bzip2 compression
        io::filtering_istream in;
        in.push(io::bzip2_decompressor());
        in.push(trans_file);
    
        //Begin reading
        std::string str;
        while(std::getline(in, str))
        {
            std::cout << str << "\n";
        }
    }
    
    #include <boost/iostreams/compose.hpp>
    #include <boost/iostreams/copy.hpp>
    #include <boost/iostreams/device/array.hpp>
    #include <boost/iostreams/device/back_inserter.hpp>
    #include <cassert>
    #include <sstream>
    
    void multiple_member_test()  // from the unit tests in changeset #63057
    { 
        std::string      data(20ul << 20, '*');
        std::vector<char>  temp, dest; 
    
        // Write compressed data to temp, twice in succession 
        io::filtering_ostream out; 
        out.push(io::bzip2_compressor()); 
        out.push(io::back_inserter(temp)); 
        io::copy(boost::make_iterator_range(data), out); 
        out.push(io::back_inserter(temp)); 
        io::copy(boost::make_iterator_range(data), out); 
    
        // Read compressed data from temp into dest 
        io::filtering_istream in; 
        in.push(io::bzip2_decompressor()); 
        in.push(io::array_source(&temp[0], temp.size())); 
        io::copy(in, io::back_inserter(dest)); 
    
        // Check that dest consists of two copies of data 
        assert(data.size() * 2 == dest.size()); 
        assert(std::equal(data.begin(), data.end(), dest.begin())); 
        assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2)); 
    
        dest.clear(); 
        io::copy( 
                io::array_source(&temp[0], temp.size()), 
                io::compose(io::bzip2_decompressor(), io::back_inserter(dest))); 
    
        // Check that dest consists of two copies of data 
        assert(data.size() * 2 == dest.size()); 
        assert(std::equal(data.begin(), data.end(), dest.begin())); 
        assert(std::equal(data.begin(), data.end(), dest.begin() + dest.size() / 2)); 
    }