c++utf-8c++20boost-locale

char8_t and utf8everywhere: How to convert to const char* APIs without invoking undefined behaviour?


As this question is some years old Is C++20 'char8_t' the same as our old 'char'?

I would like to know, what is the recommended way to handle the char8_t and char conversion right now? boost::nowide (1.80.0) doesn´t not yet understand char8_t nor (AFAIK) boost::locale.

As Tom Honermann noted that

reinterpret_cast<const char   *>(u8"text"); // Ok.
reinterpret_cast<const char8_t*>("text");   // Undefined behavior.

So: How do i interact with APIs that just accept const char* or const wchar_t* (think Win32 API) if my application "default" string type is std::u8string? The recommendation seems to be https://utf8everywhere.org/.

If i got a std::u8string and convert to std::string by

std::u8string convert(std::string str)
{
    return std::u8string(reinterpret_cast<const char8_t*>(str.data()), str.size());
}
std::string convert(std::u8string str)
{
    return std::string(reinterpret_cast<const char_t*>(str.data()), str.size());
}

This would invoke the same UB that Tom Honermann mentioned. This would be used when i talk to Win32 API or any other API that wants some const char* or gives some const char* back. I could go all conversions through boost::nowide but in the end i get a const char* back from boost::nowide::narrow() that i need to cast.

Is the current recommendation to just stay at char and ignore char8_t?


Solution

  • This would invoke the same UB that Tom Honermann mentioned.

    As pointed out in the post you referred to, UB only happens when you cast from a char* to a char8_t*. The other direction is fine.

    If you are given a char* which is encoded in UTF-8 (and you care to avoid the UB of just doing the cast for some reason), you can use std::transform to convert the chars to char8_ts by converting the characters:

    std::u8string convert(std::string str)
    {
        std::u8string ret(str.size());
        std::ranges::transform(str, ret.begin(), [](char c) {return char8_t(c);});
        return ret;
    }
    

    C++23's ranges::to will make using a named return variable unnecessary.

    For dealing with wchar_t interfaces (which you shouldn't have to, since nowadays UTF-8 support exists through narrow character interfaces on Windows), you'll have to do an actual UTF-8->UTF-16 conversion. Which you would have had to do anyway.