c++c++11unicodecodecvt

Difference between "codecvt_utf8_utf16" and "codecvt_utf8" for converting from UTF-8 to UTF-16


I came across two code snippets

std::wstring str = std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>>().from_bytes("some utf8 string");

and,

std::wstring str = std::wstring_convert<std::codecvt_utf8<wchar_t>>().from_bytes("some utf8 string");

Are they both correct ways to convert utf-8 stored in std::string to utf-16 in std::wstring?


Solution

  • codecvt_utf8_utf16 does exactly what it says: converts between UTF-8 and UTF-16, both of which are well-understood and portable encodings.

    codecvt_utf8 converts between UTF-8 and UCS-2/4 (depending on the size of the given type). UCS-2 and UTF-16 are not the same thing.

    So if your goal is to store genuine, actual UTF-16 in a wchar_t, then you should use codecvt_utf8_utf16. However, if you're trying to do cross-platform coding with wchar_t as some kind of Unicode-ish thing or whatever, you can't. The UTF-16 facet always converts to UTF-16, whereas wchar_t on non-Windows platforms is expected to generally be UTF-32/UCS-4. By contrast, codecvt_utf8 only converts to UCS-2/4, but on Windows, wchar_t strings are "supposed" to be full UTF-16.

    So you can't write code to satisfy all platforms without some #ifdef or template work. On Windows, you should use codecvt_utf8_utf16; on non-Windows, you should use codecvt_utf8.

    Or better yet, just use UTF-8 internally and find APIs that directly take strings in a specific format rather than platform-dependent wchar_t stuff.