c++visual-c++character-encodingcjkcodecvt

Decoding multibyte non-Unicode characters through codecvt fails


I have experimented with std::codecvt on MSVC and encountered an issue with multibyte character encodings ‒ it cannot convert back from valid multibyte sequences, even when those can be produced when encoding wide characters:

std::locale loc(".950");
using cvttype = std::codecvt<wchar_t, char, std::mbstate_t>;
const auto &cvt = std::use_facet<cvttype>(loc);
std::wstring_convert<cvttype> conv(&cvt);
auto bytes = conv.to_bytes(L"\u4F53"); // 体
auto str = conv.from_bytes(bytes); // range_error("bad conversion")

Yes, I know taking a facet from the locale causes ownership issues here, but this issue shows even when going through a proxy facet, or when calling std::codecvt::in manually.

I have tested this with a couple of other code pages such as 949 or 52936, all with the same result. I have used std::codecvt_utf8 before with no such issues; it is only locale-based std::codecvt that fails here.

Why does it have this behaviour? Can it be fixed?

Compiling with VS 2017.


Solution

  • It turned out I had msvcp140d.dll in the program folder which was 4 years older than the system's one. Whether it was simply bugged or incompatible with the system, removing it solved the issue.

    In case someone stumbles on the same issue and this is not the cause, you can check first whether C conversion work:

    std::setlocale(LC_ALL, ".950");
    std::mbstate_t state{};
    wchar_t res;
    std::size_t len = std::mbrtowc(&res, &bytes[0], bytes.size(), &state);
    

    If res got a valid character, you can skip the std::codecvt machinery and just use std::mbrtowc in a loop.

    If you can't call setlocale or want to rely on the std::locale instance directly, you still can, albeit with some hacks:

    // Use your compiler version and check every time you update!
    #if _MSC_VER == 1916
    using cvtvec_ptr_t = std::_Locinfo::_Cvtvec(std::codecvt<wchar_t, char, std::mbstate_t>::*);
    
    // Output variable to receive the _Cvt member
    static cvtvec_ptr_t cvtvec_ptr;
    
    namespace
    {
        template <cvtvec_ptr_t Cvtvec>
        struct get_private
        {
            get_private() noexcept
            {
                cvtvec_ptr = Cvtvec;
            }
            static get_private instance;
        };
    
        template <cvtvec_ptr_t Cvtvec>
        get_private<Cvtvec> get_private<Cvtvec>::instance;
    
        // Define an object of the type, passing the private member
        template struct get_private<&std::codecvt<wchar_t, char, std::mbstate_t>::_Cvt>;
    }
    #endif
    
    std::mbstate_t state{};
    wchar_t res;
    std::size_t len = _Mbrtowc(&res, &bytes[0], bytes.size(), &state, &(cvt.*cvtvec_ptr));
    

    The trick here is to use explicit template instantiation to introduce a global variable whose initializer sets cvtvec_ptr to point to the private std::codecvt<wchar_t, char, std::mbstate_t>::_Cvt member. This is technically valid, but it uses reserved identifiers that meant to be used from the std::codecvt::do_in implementation and not user code. While it is likely this code will either work or fail to compile, there are obviously no guarantees.