I'm coding LZ77 compression algorithm, and I have trouble storing unsigned chars in a string. To compress any file, I use its binary representation and then read it as chars
(because 1 char is equal to 1 byte, afaik) to a std::string
. Everything works perfectly fine with chars
. But after some time googling I learned that char
is not always 1 byte, so I decided to swap it for unsigned char
. And here things start to get tricky:
So, my question is – is there a way to properly save unsigned chars to a string?
I tried to use typedef basic_string<unsigned char> ustring
and swap all related functions for their basic alternatives to use with unsigned char
, but I still lose 3 bytes.
UPDATE: I found out that 3 bytes (symbols) are lost not because of std::string, but because of
std::istream_iterator
(that I use instead ofstd::istreambuf_iterator
) to create string of unsigned chars (becausestd::istreambuf_iterator
's argument is char, not unsigned char)
So, are there any solutions to this particular problem?
Example:
std::vector<char> tempbuf(std::istreambuf_iterator<char>(file), {}); // reads 112782 symbols
std::vector<char> tempbuf(std::istream_iterator<char>(file), {}); // reads 112779 symbols
Sample code:
void LZ77::readFileUnpacked(std::string& path)
{
std::ifstream file(path, std::ios::in | std::ios::binary);
if (file.is_open())
{
// Works just fine with char, but loses 3 bytes with unsigned
std::string tempstring = std::string(std::istreambuf_iterator<char>(file), {});
file.close();
}
else
throw std::ios_base::failure("Failed to open the file");
}
char
in all of its forms (and std::byte
, which is isomorphic with unsigned char
) is always the smallest possible type that a system supports. The C++ standard defines that sizeof(char)
and its variations shall always be exactly 1.
"One" what? That's implementation-defined. But every type in the system will be some multiple of sizeof(char)
in size.
So you shouldn't be too concerned over systems where char
is not one byte. If you're working under a system where CHAR_BITS
isn't 8, then that system can't handle 8-bit bytes directly at all. So unsigned char
won't be any different/better for this purpose.
As to the particulars of your problem, istream_iterator
is fundamentally different from istreambuf_iterator
iterator. The purpose of the latter is to allow iterator access to the actual stream as a sequence of values. The purpose of istream_iterator<T>
is to allow access to a stream as if by performing a repeated sequence of operator >>
calls with a T
value.
So if you're doing istream_iterator<char>
, then you're saying that you want to read the stream as if you did stream >> some_char;
variable for each iterator access. That isn't actually isomorphic with accessing the stream's characters directly. Specifically, FormattedInputFunctions like operator>>
can do things like skip whitespace, depending on how you set up your stream.