I'm in the process of upgrading my code base to C++20 and would like to make use of std::u8string
/char8_t
. I'm using a 3rd-party library that takes and returns UTF-8 strings in its API, however it hasn't been updated to C++20 yet and thus takes and returns the UTF-8 strings as regular std::string
s instead of std::u8string
s.
Converting std::u8string
to std::string
is pretty straight-forward, as the u8string
s buffer may be accessed through a char*
pointer, so
std::u8string u8s = get_data();
std::string s(reinterpret_cast<char const*>(u8s.data()), u8s.size());
is valid code. However, as far as I'm aware char8_t
does not have the aliasing exemption that std::byte
and char
have, thus
std::string s = get_data();
std::u8string u8s{reinterpret_cast<char8_t const*>(s.data()), s.size());
is not valid.
I've resorted to
std::string s = get_data();
std::u8string u8s(s.size(), u8'\0');
std::memcpy(u8s.data(), s.data(), s.size());
for now, but that seems unnecessarily inefficient given that this first initializes the memory to all zeroes before writing the actual data into it.
Is there a way to avoid the initialization to all zeroes or another way to convert between std::string
and std::u8string
altogether?
u8string u8s(s.begin(), s.end())
should work just fine. You don't need the cast. The constructor is templated, and char
implicitly converts to char8_t
.
The underlying type of char8_t
being unsigned char
is not a problem even if char
is a signed type.