c++lz77

How to use std::string to store bytes (unsigned chars) in a right way?


I'm coding LZ77 compression algorithm, and I have trouble storing unsigned chars in a string. To compress any file, I use its binary representation and then read it as chars (because 1 char is equal to 1 byte, afaik) to a std::string. Everything works perfectly fine with chars. But after some time googling I learned that char is not always 1 byte, so I decided to swap it for unsigned char. And here things start to get tricky:

So, my question is – is there a way to properly save unsigned chars to a string?

I tried to use typedef basic_string<unsigned char> ustring and swap all related functions for their basic alternatives to use with unsigned char, but I still lose 3 bytes.

UPDATE: I found out that 3 bytes (symbols) are lost not because of std::string, but because of std::istream_iterator (that I use instead of std::istreambuf_iterator) to create string of unsigned chars (because std::istreambuf_iterator's argument is char, not unsigned char)

So, are there any solutions to this particular problem?

Example:

std::vector<char> tempbuf(std::istreambuf_iterator<char>(file), {}); // reads 112782 symbols

std::vector<char> tempbuf(std::istream_iterator<char>(file), {}); // reads 112779 symbols

Sample code:

void LZ77::readFileUnpacked(std::string& path)

{


std::ifstream file(path, std::ios::in | std::ios::binary);

if (file.is_open())
{
    // Works just fine with char, but loses 3 bytes with unsigned
    std::string tempstring = std::string(std::istreambuf_iterator<char>(file), {});
    file.close();
}
else
    throw std::ios_base::failure("Failed to open the file");
}

Solution

  • char in all of its forms (and std::byte, which is isomorphic with unsigned char) is always the smallest possible type that a system supports. The C++ standard defines that sizeof(char) and its variations shall always be exactly 1.

    "One" what? That's implementation-defined. But every type in the system will be some multiple of sizeof(char) in size.

    So you shouldn't be too concerned over systems where char is not one byte. If you're working under a system where CHAR_BITS isn't 8, then that system can't handle 8-bit bytes directly at all. So unsigned char won't be any different/better for this purpose.


    As to the particulars of your problem, istream_iterator is fundamentally different from istreambuf_iterator iterator. The purpose of the latter is to allow iterator access to the actual stream as a sequence of values. The purpose of istream_iterator<T> is to allow access to a stream as if by performing a repeated sequence of operator >> calls with a T value.

    So if you're doing istream_iterator<char>, then you're saying that you want to read the stream as if you did stream >> some_char; variable for each iterator access. That isn't actually isomorphic with accessing the stream's characters directly. Specifically, FormattedInputFunctions like operator>> can do things like skip whitespace, depending on how you set up your stream.