c++stringchar8-t

Create std::u8string from std::string/char const* when the latter is already in utf-8


I'm in the process of upgrading my code base to C++20 and would like to make use of std::u8string/char8_t. I'm using a 3rd-party library that takes and returns UTF-8 strings in its API, however it hasn't been updated to C++20 yet and thus takes and returns the UTF-8 strings as regular std::strings instead of std::u8strings.

Converting std::u8string to std::string is pretty straight-forward, as the u8strings buffer may be accessed through a char* pointer, so

std::u8string u8s = get_data();
std::string s(reinterpret_cast<char const*>(u8s.data()), u8s.size());

is valid code. However, as far as I'm aware char8_t does not have the aliasing exemption that std::byte and char have, thus

std::string s = get_data();
std::u8string u8s{reinterpret_cast<char8_t const*>(s.data()), s.size());

is not valid.

I've resorted to

std::string s = get_data();
std::u8string u8s(s.size(), u8'\0');
std::memcpy(u8s.data(), s.data(), s.size());

for now, but that seems unnecessarily inefficient given that this first initializes the memory to all zeroes before writing the actual data into it.

Is there a way to avoid the initialization to all zeroes or another way to convert between std::string and std::u8string altogether?


Solution

  • u8string u8s(s.begin(), s.end()) should work just fine. You don't need the cast. The constructor is templated, and char implicitly converts to char8_t.

    The underlying type of char8_t being unsigned char is not a problem even if char is a signed type.