I'm using Windows 11. I have a program "Hello.exe"
#include <iostream>
int main(int argc, char* argv[])
{
for (int i = 0; i < argc; i++)
{
std::cout << argv[i] << std::endl;
}
}
If I pass in a Japanese UTF-8 character to this program
Hello.exe う
Then nothing is printed. And strangely, the content of this character, as recorded in argv, is 3f
. But the actual encoding of this character should be e3 81 86
.
What I've tried
(1) However, if I directly print this character in my code, the encoding would be correct in memory, and the character can be printed to stdout.
SetConsoleOutputCP(CP_UTF8);
printf("う")
(2) I also tried using wmain
instead of main
, can't be printed either. The value stored in argv is 46 30
#include <iostream>
int wmain(int argc, wchar_t** argv)
{
for (int i = 0; i < argc; i++)
{
std::wcout << argv[i] << std::endl;
}
}
(3) I also wrote a Python program, which does the same thing, and the character can be printed.
What am I missing?
Windows is using UTF-16 encoded text everywhere it expects strings. This makes implementation of cross-platform programs more difficult since typically other OS use UTF-8 as their preferred Unicode encoding. But the good news is that it is now possible to use UTF-8 in Windows applications as well.
Windows 10 since May 2019 (version 1903), and Windows 11 of course, support UTF-8 codepage. With help of a manifest file that needs to be embedded in the .exe
file, the developper can tell Windows system to set UTF-8 codepage when running the application.
The manifest file is typically that file:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="..." version="6.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings">UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
You use mt.exe
to add the manifest to the executable, or add the file as manifest in .vsproj
on Visual Studio
Microsoft compiler (MSVC) needs flag /utf-8
to let it know that the source files are encoded in UTF-8 and that you want to output text as UTF-8. Don't forget that flag in projects.
For Windows console applications only, call SetConsoleOutputCP(CP_UTF8)
at start of main
function. This is curiously required even with the manifest, as the console defaults to Windows locale and not UTF-8.
Since you enabled UTF-8 as application codepage, you don't need to call the Unicode version of Win32 API, function names ending in W
like MessageBoxW
. Instead, call the ANSI version ending in A
like MessageBoxA
and pass it UTF-8 text. Don't define neither _UNICODE
nor UNICODE
macros as those would select the W
variant of API, even though you want to support Unicode through UTF-8. Then you can also use the Unicode agnostic macros like MessageBox
that will properly select MessageBoxA
.
There are unfortunately some rare Windows API that do only exist in UTF-16 (wchar_t*
) version. For those, you will need to manually convert your UTF-8 string into UTF-16 for example with codecvt
.