c++utf-8

C++ argv with UTF-8 values are incorrect in the program


I'm using Windows 11. I have a program "Hello.exe"

#include <iostream>

int main(int argc, char* argv[])
{
    for (int i = 0; i < argc; i++)
    {
        std::cout << argv[i] << std::endl;
    }
}

If I pass in a Japanese UTF-8 character to this program

Hello.exe う

Then nothing is printed. And strangely, the content of this character, as recorded in argv, is 3f. But the actual encoding of this character should be e3 81 86.

What I've tried

(1) However, if I directly print this character in my code, the encoding would be correct in memory, and the character can be printed to stdout.

SetConsoleOutputCP(CP_UTF8);
printf("う")

(2) I also tried using wmain instead of main, can't be printed either. The value stored in argv is 46 30

#include <iostream>

int wmain(int argc, wchar_t** argv)
{
    for (int i = 0; i < argc; i++)
    {
        std::wcout << argv[i] << std::endl;
    }
}

(3) I also wrote a Python program, which does the same thing, and the character can be printed.

What am I missing?


Solution

  • Use UTF-8 on Windows

    Windows is using UTF-16 encoded text everywhere it expects strings. This makes implementation of cross-platform programs more difficult since typically other operating systems use UTF-8 as their preferred Unicode encoding. But the good news is that it is now possible to use UTF-8 in Windows applications as well.

    1. Embed UTF-8 Manifest

    Windows 10 since May 2019 (version 1903), and Windows 11 of course, support UTF-8 codepage. With help of a manifest file that needs to be embedded in the .exe file, the developper can tell Windows system to set UTF-8 codepage when running the application. The manifest file is typically that file:

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
      <assemblyIdentity type="win32" name="..." version="6.0.0.0"/>
      <application>
        <windowsSettings>
          <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
        </windowsSettings>
      </application>
    </assembly>
    

    You use mt.exe to add the manifest to the executable, or add the file as manifest in .vsproj on Visual Studio

    1. Compile with /utf-8

    Microsoft compiler (MSVC) needs flag /utf-8 to let it know that the source files are encoded in UTF-8 and that you want to output text as UTF-8. Don't forget that flag in projects.

    1. Configure the console as UTF-8

    For Windows console applications, call at start of main function SetConsoleOutputCP(CP_UTF8); for output and SetConsoleCP(CP_UTF8); for input. This is curiously required even with the manifest, as the console defaults to Windows OEM locale and not UTF-8.

    BUG: from my experiments, it seems that on Windows 10, inputting UTF-8 from the console does not work, whatever you try, except if somehow you call ReadConsoleW manually and adjust. On Windows 11, however, it works.

    1. Always use the ANSI Windows API

    Windows API functions exist in two flavors. There are functions ending in A (for ANSI) that expect const char* zero-terminated strings, and there are those ending in W (for wide) that expect const wchar_t* zero-terminated strings. The type wchar_t is 16-bit wide on Windows, and the wide strings are expected to be UTF-16LE encoded.

    Since you enabled UTF-8 as application codepage, you don't want to use the W wide API, but the A ANSI functions. So, although you actually want to support Unicode, don't define neither _UNICODE nor UNICODE macros as those would select the W variant of API. Alternately, in Visual Studio, select Use Multi-Byte Character Set for the Character Set parameter (in Advanced configuration properties).

    Then you can also use the Unicode agnostic macros like MessageBox that will properly select MessageBoxA.

    There are unfortunately some rare Windows API that do only exist in UTF-16 (wchar_t*) version. For those, you will need to manually convert your UTF-8 string into UTF-16 for example with std::codecvt or MultiByteToWideChar.

    Example

    Here is a Hello World demonstration

    Hello-UTF-8.cpp: must be stored with UTF-8 encoding. BOM is permitted, but not recommended.

    #define _CRT_SECURE_NO_WARNINGS
    #include <Windows.h>
    #include <iostream>
    #include <string>
    #include <cstdio>
    
    int main(int argc, char* argv[])
    {
        SetConsoleOutputCP(CP_UTF8);
        SetConsoleCP(CP_UTF8);
        std::string str = "議論\n";
        for(int i=0; i<argc; i++)
        {
            str += argv[i];
            str += "\n";
        }
            std::cout << str;
        FILE* file = fopen("Деякий файл.txt", "wt");
        fputs(str.c_str(), file);   
        MessageBox(nullptr, str.c_str(), "Γεια σου κόσμε", MB_OK);
    }
    

    utf8.manifest: exactly as above (I don't care about the dummy name):

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
      <assemblyIdentity type="win32" name="..." version="6.0.0.0"/>
      <application>
        <windowsSettings>
          <activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
        </windowsSettings>
      </application>
    </assembly>
    

    Compiled and run on PowerShell (for proper Unicode handling):

    PS E:\Привет> cl Hello-UTF-8.cpp /utf-8 /nologo User32.lib /EHsc
    Hello-UTF-8.cpp
    
    PS E:\Привет> mt -nologo -manifest utf8.manifest -outputresource:Hello-UTF-8.exe;#1
    
    PS E:\Привет> .\Hello-UTF-8.exe こんにちは κόσμος
    議論
    E:\Привет\Hello-UTF-8.exe
    こんにちは
    κόσμος
    
    PS E:\Привет> dir *.txt
    
        Répertoire : E:\Привет
    
    Mode                 LastWriteTime         Length Name
    ----                 -------------         ------ ----
    -a----        20.06.2025     11:11             72 Деякий файл.txt