c++c++11language-lawyerstrict-aliasinguint8t

reinterpret_cast between char* and std::uint8_t* - safe?


Now we all sometimes have to work with binary data. In C++ we work with sequences of bytes, and since the beginning char was the our building block. Defined to have sizeof of 1, it is the byte. And all library I/O functions use char by default. All is good but there was always a little concern, a little oddity that bugged some people - the number of bits in a byte is implementation-defined.

So in C99, it was decided to introduce several typedefs to let the developers easily express themselves, the fixed-width integer types. Optional, of course, since we never want to hurt portability. Among them, uint8_t, migrated into C++11 as std::uint8_t, a fixed width 8-bit unsigned integer type, was the perfect choice for people who really wanted to work with 8 bit bytes.

And so, developers embraced the new tools and started building libraries that expressively state that they accept 8-bit byte sequences, as std::uint8_t*, std::vector<std::uint8_t> or otherwise.

But, perhaps with a very deep thought, the standardization committee decided not to require implementation of std::char_traits<std::uint8_t> therefore prohibiting developers from easily and portably instantiating, say, std::basic_fstream<std::uint8_t> and easily reading std::uint8_ts as a binary data. Or maybe, some of us don't care about the number of bits in a byte and are happy with it.

But unfortunately, two worlds collide and sometimes you have to take a data as char* and pass it to a library that expects std::uint8_t*. But wait, you say, isn't char variable bit and std::uint8_t is fixed to 8? Will it result into a loss of data?

Well, there is an interesting Standardese on this. The char defined to hold exactly one byte and byte is the lowest addressable chunk of memory, so there can't be a type with bit width lesser than that of char. Next, it is defined to be able to hold UTF-8 code units. This gives us the minimum - 8 bits. So now we have a typedef which is required to be 8 bits wide and a type that is at least 8 bits wide. But are there alternatives? Yes, unsigned char. Remember that signedness of char is implementation-defined. Any other type? Thankfully, no. All other integral types have required ranges which fall outside of 8 bits.

Finally, std::uint8_t is optional, that means that the library which uses this type will not compile if it's not defined. But what if it compiles? I can say with a great degree of confidence that this means that we are on a platform with 8 bit bytes and CHAR_BIT == 8.

Once we have this knowledge, that we have 8-bit bytes, that std::uint8_t is implemented as either char or unsigned char, can we assume that we can do reinterpret_cast from char* to std::uint8_t* and vice versa? Is it portable?

This is where my Standardese reading skills fail me. I read about safely derived pointers ([basic.stc.dynamic.safety]) and, as far as I understand, the following:

std::uint8_t* buffer = /* ... */ ;
char* buffer2 = reinterpret_cast<char*>(buffer);
std::uint8_t buffer3 = reinterpret_cast<std::uint8_t*>(buffer2);

is safe if we don't touch buffer2. Correct me if I'm wrong.

So, given the following preconditions:

Is it portable and safe to cast char* and std::uint8_t* back and forth, assuming that we're working with binary data and the potential lack of sign of char doesn't matter?

I would appreciate references to the Standard with explanations.

EDIT: Thanks, Jerry Coffin. I'm going to add the quote from the Standard ([basic.lval], §3.10/10):

If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:

...

— a char or unsigned char type.

EDIT2: Ok, going deeper. std::uint8_t is not guaranteed to be a typedef of unsigned char. It can be implemented as extended unsigned integer type and extended unsigned integer types are not included in §3.10/10. What now?


Solution

  • Ok, let's get truly pedantic. After reading this, this and this, I'm pretty confident that I understand the intention behind both Standards.

    So, doing reinterpret_cast from std::uint8_t* to char* and then dereferencing the resulting pointer is safe and portable and is explicitly permitted by [basic.lval].

    However, doing reinterpret_cast from char* to std::uint8_t* and then dereferencing the resulting pointer is a violation of strict aliasing rule and is undefined behavior if std::uint8_t is implemented as extended unsigned integer type.

    However, there are two possible workarounds, first:

    static_assert(std::is_same_v<std::uint8_t, char> ||
        std::is_same_v<std::uint8_t, unsigned char>,
        "This library requires std::uint8_t to be implemented as char or unsigned char.");
    

    With this assert in place, your code will not compile on platforms on which it would result in undefined behavior otherwise.

    Second:

    std::memcpy(uint8buffer, charbuffer, size);
    

    Cppreference says that std::memcpy accesses objects as arrays of unsigned char so it is safe and portable.

    To reiterate, in order to be able to reinterpret_cast between char* and std::uint8_t* and work with resulting pointers portably and safely in a 100% standard-conforming way, the following conditions must be true:

    On a practical note, the above conditions are true on 99% of platforms and there is likely no platform on which the first 2 conditions are true while the 3rd one is false.