I have experimented with std::codecvt
on MSVC and encountered an issue with multibyte character encodings ‒ it cannot convert back from valid multibyte sequences, even when those can be produced when encoding wide characters:
std::locale loc(".950");
using cvttype = std::codecvt<wchar_t, char, std::mbstate_t>;
const auto &cvt = std::use_facet<cvttype>(loc);
std::wstring_convert<cvttype> conv(&cvt);
auto bytes = conv.to_bytes(L"\u4F53"); // 体
auto str = conv.from_bytes(bytes); // range_error("bad conversion")
Yes, I know taking a facet from the locale causes ownership issues here, but this issue shows even when going through a proxy facet, or when calling std::codecvt::in
manually.
I have tested this with a couple of other code pages such as 949 or 52936, all with the same result. I have used std::codecvt_utf8
before with no such issues; it is only locale-based std::codecvt
that fails here.
Why does it have this behaviour? Can it be fixed?
Compiling with VS 2017.
It turned out I had msvcp140d.dll in the program folder which was 4 years older than the system's one. Whether it was simply bugged or incompatible with the system, removing it solved the issue.
In case someone stumbles on the same issue and this is not the cause, you can check first whether C conversion work:
std::setlocale(LC_ALL, ".950");
std::mbstate_t state{};
wchar_t res;
std::size_t len = std::mbrtowc(&res, &bytes[0], bytes.size(), &state);
If res
got a valid character, you can skip the std::codecvt
machinery and just use std::mbrtowc
in a loop.
If you can't call setlocale
or want to rely on the std::locale
instance directly, you still can, albeit with some hacks:
// Use your compiler version and check every time you update!
#if _MSC_VER == 1916
using cvtvec_ptr_t = std::_Locinfo::_Cvtvec(std::codecvt<wchar_t, char, std::mbstate_t>::*);
// Output variable to receive the _Cvt member
static cvtvec_ptr_t cvtvec_ptr;
namespace
{
template <cvtvec_ptr_t Cvtvec>
struct get_private
{
get_private() noexcept
{
cvtvec_ptr = Cvtvec;
}
static get_private instance;
};
template <cvtvec_ptr_t Cvtvec>
get_private<Cvtvec> get_private<Cvtvec>::instance;
// Define an object of the type, passing the private member
template struct get_private<&std::codecvt<wchar_t, char, std::mbstate_t>::_Cvt>;
}
#endif
std::mbstate_t state{};
wchar_t res;
std::size_t len = _Mbrtowc(&res, &bytes[0], bytes.size(), &state, &(cvt.*cvtvec_ptr));
The trick here is to use explicit template instantiation to introduce a global variable whose initializer sets cvtvec_ptr
to point to the private std::codecvt<wchar_t, char, std::mbstate_t>::_Cvt
member. This is technically valid, but it uses reserved identifiers that meant to be used from the std::codecvt::do_in
implementation and not user code. While it is likely this code will either work or fail to compile, there are obviously no guarantees.