c++windowsutf-8std-filesystem

Issue with std::filesystem::path conversion to std::string in C++


I'm facing an issue while attempting to fetch all filenames from a directory. The problem arises when handling certain strings, resulting in errors. Below is the code snippet:

#include <filesystem>

int main()
{
    const char* dir = "D:\\Music";
    std::vector<std::string> musicList;

    for (const auto& entry : std::filesystem::recursive_directory_iterator(dir))
    {
        if (entry.is_regular_file())
        {
            musicList.emplace_back(entry.path().string());
        }
    }
}

The issue occurs at entry.path().string() when processing strings like L"D:\\Music\\suki\\Angel Note - 月明かりは優しく・・・.mp3". The program terminates with an error pointing to:

_STD_BEGIN
// We would really love to use the proper way of building error_code by specializing
// is_error_code_enum and make_error_code for __std_win_error, but because:
//   1. We would like to keep the definition of __std_win_error in xfilesystem_abi.h
//   2. and xfilesystem_abi.h cannot include <system_error>
//   3. and specialization of is_error_code_enum and overload of make_error_code
//      need to be kept together with the enum (see limerick in N4950 [temp.expl.spec]/8)
// we resort to using this _Make_ec helper.
_NODISCARD inline error_code _Make_ec(__std_win_error _Errno) noexcept { // make an error_code
    return { static_cast<int>(_Errno), _STD system_category() };
}

[[noreturn]] inline void _Throw_system_error_from_std_win_error(const __std_win_error _Errno) {
    _THROW(system_error{ _Make_ec(_Errno) });  // Here occur error!
}
_STD_END

I compiled the code in Visual Studio 2022, and the C++ standard is C++17.

Upon investigation, I simplified the issue with:

#include <filesystem>

int main()
{
    std::filesystem::path path = L"・";
    auto str = path.string();
}

Similar issues arose at path.string(). Upon further simplification using L"\u30FB", I discovered the character is represented as "\u30FB".

While path.wstring(), path.u8string(), and other string conversions work well, I need a char* for APIs such as ImGui::Text(str) or FMOD's API. Attempts to convert wstring to string using codecvt, Win32 API, or ICU resulted in garbled text like "・":

#include <filesystem>
#include <Windows.h>

std::string ws2s(const std::wstring& wstr)
{
    int len = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, nullptr, 0, nullptr, nullptr);
    std::string str;
    str.reserve(len);
    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, str.data(), len, nullptr, nullptr);
    return str;
}

int main()
{
    std::filesystem::path path = L"\u30FB";
    auto str = ws2s(path.wstring());
}

The resulting str was "・" instead of "\u30FB".

Is there a reliable method to handle this situation effectively?


Okay, I found the issue. It's the encoding used in the VS debug interface. It doesn't display UTF-8 properly. For example, the contents of my vector<string> are in UTF-8, but the debug interface shows garbled text, like the content of vec[0]. All I need to do is append ,s8 to vec[0] in the watch window. This forces the debug display to show UTF-8 content correctly.
Oh Microsoft, why do you insist on UTF-16? Isn't UTF-8 good enough?


Solution

  • The program crashes since std::filesystem::path::string throws an exception and your code does not catch it. This is a problem with encoding. Add this at the beginning of your program and the issue should be resolved:

    static constexpr char localeName[] = "ja_JP.utf-8";
    
    // Instruct the C standard library that Japanese will be used with UTF-8 encoding
    std::setlocale(LC_ALL, localeName);
    
    // Instruct the C++ standard library that Japanese will be used with UTF-8 encoding, for example in std::string, std::ostream
    std::locale::global(std::locale(localeName));
    
    // Use the system locale (language and encoding) when printing data to std::cout
    // Note that if your system is using a different encoding than UTF-8, like CP932, the C++ standard library will implicitly do a conversion.
    std::cout.imbue(std::locale{""});
    

    I had a similar problem with boost::filesystem::path and this resolved the issues.

    Note that the encoding part is most important. On MSVC, this should address the issue too:

    static constexpr char localeName[] = ".utf-8";

    Here is full demo, with this code:

    #include <iostream>
    #include <filesystem>
    #include <locale>
    
    #define LOG(x) std::cout << #x " = " << x << '\n'
    
    int main()
    {
        std::locale::global(std::locale{".utf-8"});
        // use system encoding - language neutral
        std::locale sysLoc{std::locale{"C"}, "", std::locale::ctype};
        std::cout.imbue(sysLoc);
        std::cerr.imbue(sysLoc);
        
        for (const auto& dir_en : std::filesystem::directory_iterator{"."})
        {
            LOG(dir_en);
            LOG(dir_en.path());
            LOG(dir_en.path().string());
            std::cout << "---------------\n";
        }
    }
    

    I got this results:

    C:\Users\marekR22\Downloads\MyDir>dir
     Volume in drive C has no label.
     Volume Serial Number is 5608-EF1A
    
     Directory of C:\Users\marekR22\Downloads\MyDir
    
    07/16/2024  03:58 PM    <DIR>          .
    07/16/2024  03:58 PM    <DIR>          ..
    07/16/2024  03:56 PM                47 Angel Note - 月明かりは優しく・・・.txt
    07/16/2024  03:55 PM               526 main.cpp
                   2 File(s)            573 bytes
                   2 Dir(s)   7,074,545,664 bytes free
    
    C:\Users\marekR22\Downloads\MyDir>cl /std:c++20 /EHcs /O2 /D NDEBUG /utf-8 main.cpp
    Microsoft (R) C/C++ Optimizing Compiler Version 19.39.33523 for x64
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    main.cpp
    Microsoft (R) Incremental Linker Version 14.39.33523.0
    Copyright (C) Microsoft Corporation.  All rights reserved.
    
    /out:main.exe
    main.obj
    
    C:\Users\marekR22\Downloads\MyDir>chcp
    Active code page: 852
    
    C:\Users\marekR22\Downloads\MyDir>main.exe
    dir_en = ".\\Angel Note - ????????.txt"
    dir_en.path() = ".\\Angel Note - ????????.txt"
    dir_en.path().string() = .\Angel Note - ????????.txt
    ---------------
    dir_en = ".\\main.cpp"
    dir_en.path() = ".\\main.cpp"
    dir_en.path().string() = .\main.cpp
    ---------------
    dir_en = ".\\main.exe"
    dir_en.path() = ".\\main.exe"
    dir_en.path().string() = .\main.exe
    ---------------
    dir_en = ".\\main.obj"
    dir_en.path() = ".\\main.obj"
    dir_en.path().string() = .\main.obj
    ---------------
    
    C:\Users\marekR22\Downloads\MyDir>chcp 65001
    Active code page: 65001
    
    C:\Users\marekR22\Downloads\MyDir>main.exe
    dir_en = ".\\Angel Note - 月明かりは優しく・・・.txt"
    dir_en.path() = ".\\Angel Note - 月明かりは優しく・・・.txt"
    dir_en.path().string() = .\Angel Note - 月明かりは優しく・・・.txt
    ---------------
    dir_en = ".\\main.cpp"
    dir_en.path() = ".\\main.cpp"
    dir_en.path().string() = .\main.cpp
    ---------------
    dir_en = ".\\main.exe"
    dir_en.path() = ".\\main.exe"
    dir_en.path().string() = .\main.exe
    ---------------
    dir_en = ".\\main.obj"
    dir_en.path() = ".\\main.obj"
    dir_en.path().string() = .\main.obj
    ---------------
    
    C:\Users\marekR22\Downloads\MyDir>chcp 932
    Active code page: 932
    
    C:\Users\marekR22\Downloads\MyDir>main.exe
    dir_en = ".\\Angel Note - 月明かりは優しく・・・.txt"
    dir_en.path() = ".\\Angel Note - 月明かりは優しく・・・.txt"
    dir_en.path().string() = .\Angel Note - 月明かりは優しく・・・.txt
    ---------------
    dir_en = ".\\main.cpp"
    dir_en.path() = ".\\main.cpp"
    dir_en.path().string() = .\main.cpp
    ---------------
    dir_en = ".\\main.exe"
    dir_en.path() = ".\\main.exe"
    dir_en.path().string() = .\main.exe
    ---------------
    dir_en = ".\\main.obj"
    dir_en.path() = ".\\main.obj"
    dir_en.path().string() = .\main.obj
    ---------------
    

    Note that when my code page do not support Japanese characters ? is printed (no crash). After I've change code page to 65001 (which represent UTF-8) proper Japanese characters are printed. It also works perfectly when Japanese code page 932 is used.