c++utf-8

C++ argv with UTF-8 values are incorrect in the program


I'm using Windows 11. I have a program "Hello.exe"

#include <iostream>

int main(int argc, char* argv[])
{
    for (int i = 0; i < argc; i++)
    {
        std::cout << argv[i] << std::endl;
    }
}

If I pass in a Japanese UTF-8 character to this program

Hello.exe う

Then nothing is printed. And strangely, the content of this character, as recorded in argv, is 3f. But the actual encoding of this character should be e3 81 86.

What I've tried

(1) However, if I directly print this character in my code, the encoding would be correct in memory, and the character can be printed to stdout.

SetConsoleOutputCP(CP_UTF8);
printf("う")

(2) I also tried using wmain instead of main, can't be printed either. The value stored in argv is 46 30

#include <iostream>

int wmain(int argc, wchar_t** argv)
{
    for (int i = 0; i < argc; i++)
    {
        std::wcout << argv[i] << std::endl;
    }
}

(3) I also wrote a Python program, which does the same thing, and the character can be printed.

What am I missing?


Solution

  • Use UTF-8 on Windows

    Windows is using UTF-16 encoded text everywhere it expects strings. This makes implementation of cross-platform programs more difficult since typically other OS use UTF-8 as their preferred Unicode encoding. But the good news is that it is now possible to use UTF-8 in Windows applications as well.

    1. Embed UTF-8 Manifest

    Windows 10 since May 2019 (version 1903), and Windows 11 of course, support UTF-8 codepage. With help of a manifest file that needs to be embedded in the .exe file, the developper can tell Windows system to set UTF-8 codepage when running the application. The manifest file is typically that file:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
      <assemblyIdentity type="win32" name="..." version="6.0.0.0"/>
      <application>
        <windowsSettings>
          <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
        </windowsSettings>
      </application>
    </assembly>
    

    You use mt.exe to add the manifest to the executable, or add the file as manifest in .vsproj on Visual Studio

    1. Compile with /utf-8

    Microsoft compiler (MSVC) needs flag /utf-8 to let it know that the source files are encoded in UTF-8 and that you want to output text as UTF-8. Don't forget that flag in projects.

    1. Configure the console as UTF-8

    For Windows console applications only, call SetConsoleOutputCP(CP_UTF8) at start of main function. This is curiously required even with the manifest, as the console defaults to Windows locale and not UTF-8.

    1. Always use the ANSI Windows API

    Since you enabled UTF-8 as application codepage, you don't need to call the Unicode version of Win32 API, function names ending in W like MessageBoxW. Instead, call the ANSI version ending in A like MessageBoxA and pass it UTF-8 text. Don't define neither _UNICODE nor UNICODE macros as those would select the W variant of API, even though you want to support Unicode through UTF-8. Then you can also use the Unicode agnostic macros like MessageBox that will properly select MessageBoxA.

    There are unfortunately some rare Windows API that do only exist in UTF-16 (wchar_t*) version. For those, you will need to manually convert your UTF-8 string into UTF-16 for example with codecvt.