c++unicodeutf-8visual-studio-2017c++-experimental

UTF-8 support in Visual Studio 2017 std::experimental::filesystem::path


I was happy to see the addition of support for std::experimental::filesystem in Visual Studio 2017, but just now ran into issues with Unicode. I kinda blindly assumed that I could use UTF-8 strings everywhere, but failed - when constructing a std::experimental::filesystem::path from a char* to a UTF-8 encoded string no conversion happens (even though the headers use _To_wide and _To_byte functions internally. I wrote a simple test example:

#include <string>
#include <experimental\filesystem>

#define WIN32_LEAN_AND_MEAN
#include <Windows.h>

static inline std::string FromUtf16(const wchar_t* pUtf16String)
{
    int nUtf16StringLength = static_cast<int>(wcslen(pUtf16String));
    int nUtf8StringLength = ::WideCharToMultiByte(CP_UTF8, 0, pUtf16String, nUtf16StringLength, NULL, 0, NULL, NULL);
    std::string sUtf8String(nUtf8StringLength, '\0');
    nUtf8StringLength = ::WideCharToMultiByte(CP_UTF8, 0, pUtf16String, nUtf16StringLength, const_cast<char *>(sUtf8String.c_str()), nUtf8StringLength, NULL, NULL);
    return sUtf8String;
}

static inline std::string FromUtf16(const std::wstring& sUtf16String)
{
    return FromUtf16(sUtf16String.c_str());
}

static inline std::wstring ToUtf16(const char* pUtf8String)
{
    int nUtf8StringLength = static_cast<int>(strlen(pUtf8String));
    int nUtf16StringLength = ::MultiByteToWideChar(CP_UTF8, 0, pUtf8String, nUtf8StringLength, NULL, NULL);
    std::wstring sUtf16String(nUtf16StringLength, '\0');
    nUtf16StringLength = ::MultiByteToWideChar(CP_UTF8, 0, pUtf8String, nUtf8StringLength, const_cast<wchar_t*>(sUtf16String.c_str()), nUtf16StringLength);
    return sUtf16String;
}

static inline std::wstring ToUtf16(const std::string& sUtf8String)
{
    return ToUtf16(sUtf8String.c_str());
}

int main(int argc, char** argv)
{
    std::string sTest(u8"Kaķis");
    std::wstring sWideTest(ToUtf16(sTest));
    wchar_t pWideTest[1024] = {};
    char pByteTest[1024];
    std::experimental::filesystem::path Path1(sTest), Path2(sWideTest);
    std::experimental::filesystem::v1::_To_wide(sTest.c_str(), pWideTest);
    bool bWideEqual = sWideTest == pWideTest;
    std::experimental::filesystem::v1::_To_byte(pWideTest, pByteTest);
    bool bUtf8Equal = sTest == pByteTest;
    bool bPathsEqual = Path1 == Path2;
    printf("wide equal: %d, utf-8 equal: %d, paths equal: %d\n", bWideEqual, bUtf8Equal, bPathsEqual);
}

But as I stated earlier, I just blindly assumed that UTF-8 would work. Looking at std::experimental::filesystem::path on cppreference.com under the constructor secion it actually states that:

  • If the source character type is char, the encoding of the source is assumed to be the native narrow encoding (so no conversion takes place on POSIX systems)
  • If the source character type is char16_t, conversion from UTF-16 to native filesystem encoding is used.
  • If the source character type is char32_t, conversion from UTF-32 to native filesystem encoding is used.
  • If the source character type is wchar_t, the input is assumed to be the native wide encoding (so no conversion takes places on Windows)

I am not sure how to interpret the first line. First, it states something only about POSIX systems (even though I do not understand what is the native narrow encoding, does that mean UTF-8 will not work on POSIX as well?). Second, it does not state anything about Windows, and MSDN is silent on this as well. So, how to property handle initializaiton of std::experimental::filesystem::path from Unicode characters in a cross-platform safe manner?


Solution

  • The "narrow" (8-bit) encoding of filesystem::path depends on the environment and host OS. It might be UTF-8 on many POSIX systems, but it also may not. If you want to use UTF-8, you should use it explicitly, via std::filesystem::path::u8string() and std::filesystem::u8path()