c++windowsutf-8codepagesboost-locale

Get the user's codepage name for functions in boost::locale::conv


The task at hand

I'm parsing a filename from an UTF-8 encoded XML on Windows. I need to pass that filename on to a function that I can't change. Internally it uses _fsopen() which does not support Unicode strings.

Current approach

My current approach is to convert the filename to the user's charset hoping that the filename is representable in that encoding. I'm then using boost::locale::conv::from_utf() to convert from UTF-8 and I'm using boost::locale::util::get_system_locale() to get the name of the current locale.

Life is good?

I'm on a German system using code page Windows-1252 thus get_system_locale() correctly yields de_DE.windows-1252. If I test the approach with a filename containing an umlaut everything works as expected.

The Problem

Just to make sure I switched my system locale to Ukrainian which uses code page Windows-1251. Using some Cyrillic letter in the filename my approach fails. The reason is that get_system_locale() still yields de_DE.windows-1252 which is now incorrect.

On the other side GetACP() correctly yields 1252 for the German locale and 1251 for the Ukrainian locale. I also know that Boost.Locale can convert to a given locale as this small test program works as I expect:

#include <boost/locale.hpp>
#include <iostream>
#include <string>
#include <windows.h>

int main()
{
    std::cout << "Codepage: " << GetACP() << std::endl;
    std::cout << "Boost.Locale: " << boost::locale::util::get_system_locale() << std::endl;

    namespace blc = boost::locale::conv;
    // Cyrillic small letter zhe -> \xe6 (ш on 1251, æ on 1252)
    std::string const test1251 = blc::from_utf(std::string("\xd0\xb6"), "windows-1251");
    std::cout << "1251: " << static_cast<int>(test1251.front()) << std::endl;
    // Latin small letter sharp s -> \xdf (Я on 1251, ß on 1252)
    auto const test1252 = blc::from_utf(std::string("\xc3\x9f"), "windows-1252");
    std::cout << "1252: " << static_cast<int>(test1252.front()) << std::endl;

}

Questions

Fine-print


Solution

  • ANSI is deprecated so don't bother with it.

    Windows uses UTF16, you must convert from UTF8 to UTF16 using MultiByteToWideChar. This conversion is safe.

    std::wstring getU16(const std::string &str)
    {
        if (str.empty()) return std::wstring();
        int sz = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), 0, 0);
        std::wstring res(sz, 0);
        MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), &res[0], sz);
        return res;
    }
    

    You then use _wfsopen (from the link you provided) to open file with UTF16 name.

    int main()
    {
        //UTF8 source:
        std::string filename_u8;
    
        //This line works in VS2015 only
        //For older version comment out the next line, obtain UTF8 from another source
        filename_u8 = u8"c:\\test\\__ελληνικά.txt";
    
        //convert to UTF16
        std::wstring filename_utf16 = getU16(filename_u8);
    
        FILE *file = NULL;
        _wfopen_s(&file, filename_utf16.c_str(), L"w");
        if (file)
        {
            //Add BOM, optional...
    
            //Write the file name in to file, for testing...
            fwrite(filename_u8.data(), 1, filename_u8.length(), file);
    
            fclose(file);
        }
        else
        {
            cout << "access denined, or folder doesn't exits...
        }
    
        return 0;
    }
    


    Edit, getting ANSI from UTF8, using GetACP()

    std::wstring string_to_wstring(const std::string &str, int codepage)
    {
        if (str.empty()) return std::wstring();
        int sz = MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), 0, 0);
        std::wstring res(sz, 0);
        MultiByteToWideChar(codepage, 0, &str[0], (int)str.size(), &res[0], sz);
        return res;
    }
    
    std::string wstring_to_string(const std::wstring &wstr, int codepage)
    {
        if (wstr.empty()) return std::string();
        int sz = WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), 0, 0, 0, 0);
        std::string res(sz, 0);
        WideCharToMultiByte(codepage, 0, &wstr[0], (int)wstr.size(), &res[0], sz, 0, 0);
        return res;
    }
    
    std::string get_ansi_from_utf8(const std::string &utf8, int codepage)
    {
        std::wstring utf16 = string_to_wstring(utf8, CP_UTF8);
        std::string ansi = wstring_to_string(utf16, codepage);
        return ansi;
    }