My main platform is Windows which is the reason why I use internally UTF-16 (mostly BMP strings). I would like to use console output for these strings.
Unfortunately there is no std::u16cout
or std::u8cout
so I need to use std::wcout
. Therefore I must convert my u16strings to wstrings - what is the best (and easiest) way to do that?
On Windows I know that wstring points to UTF16 data, so I can create a simple std::u16string_view which uses the same data (no conversion). But on Linux wstring is usually UTF32... Is there a way to do that without macros and without things like assuming sizeof(wchar_t) == 2 => utf16?
There is nothing in the C++20 standard that converts wchar_t
to char32_t
and back. After all, wchar_t
is supposed to be large enough to contain any supported code point.
And indeed everywhere Unicode above U+FFFF is supported, wchar_t
is 32-bit, except on Windows (and in Java, but that's irrelevant). So yes, even today working with Unicode in a portable way is problematic, and sizeof(wchar_t)==2
or #ifdef _WIN32
both sound like legitimate workarounds.
Having said that, wcout
still seamlessly works with wchar_t
on all platforms regardless of the underlying encoding.
It is only if you cut wstrings or work with individual code points and you want to support code points beyond the basic plane, then you need to take surrogate pairs into account (which is pretty easy still, 0xD800–0xDBFF = first pair, 0xDC00–0xDFFF = second pair, don't cut in between).