c++c++17utf-16wstringutf-32

Conversion from wstring to u16string and back (standard conform) in C++17 / C++20


My main platform is Windows which is the reason why I use internally UTF-16 (mostly BMP strings). I would like to use console output for these strings.

Unfortunately there is no std::u16cout or std::u8cout so I need to use std::wcout. Therefore I must convert my u16strings to wstrings - what is the best (and easiest) way to do that?

On Windows I know that wstring points to UTF16 data, so I can create a simple std::u16string_view which uses the same data (no conversion). But on Linux wstring is usually UTF32... Is there a way to do that without macros and without things like assuming sizeof(wchar_t) == 2 => utf16?


Solution

  • There is nothing in the C++20 standard that converts wchar_t to char32_t and back. After all, wchar_t is supposed to be large enough to contain any supported code point.

    And indeed everywhere Unicode above U+FFFF is supported, wchar_t is 32-bit, except on Windows (and in Java, but that's irrelevant). So yes, even today working with Unicode in a portable way is problematic, and sizeof(wchar_t)==2 or #ifdef _WIN32 both sound like legitimate workarounds.

    Having said that, wcout still seamlessly works with wchar_t on all platforms regardless of the underlying encoding.

    It is only if you cut wstrings or work with individual code points and you want to support code points beyond the basic plane, then you need to take surrogate pairs into account (which is pretty easy still, 0xD800–0xDBFF = first pair, 0xDC00–0xDFFF = second pair, don't cut in between).