I m building a command line tool using Lua, users may call my script with utf8 arguments.
Programming in Lua 4th edition says:
Several things in Lua “just work” for UTF-8 strings. Because Lua is 8-bit clean, it can read, write, and store UTF-8 strings just like other strings.
but this seems not true for cli parameters, here is a small test:
test.lua contain:
io.write(arg[1])
i run it like that:
lua test.lua かسжГ > test.txt
i get
????
and i get the same result with:
io.open("test.txt", "wb"):write(arg[1])
test done with lua-5.4.8_Win32 on win 7 x64
how to solve it? is there a workaround?
update:
this is not a duplicate of How can I use Unicode characters on the Windows command line?
That link talks about chcp 65001
which I already tested and got the same result, because chcp changes the console's code page, it doesn't automatically force all applications launched from that CMD session to operate in full UTF-8 mode internally like it happens in linux with LC_ALL
.
Many older Windows applications and even parts of the Windows API (often referred to as "ANSI" APIs) still rely on the system's default ANSI code page. If these applications don't explicitly use Unicode (UTF-16) APIs, they might still misinterpret or mangle UTF-8 data, even if the console is set to 65001.
one of the answers in that link says the same thing:
I see several answers here, but they don't seem to address the question—the user wants to get Unicode input from the command line.
Windows uses UTF-16 for encoding in two byte strings, so you need to get these from the OS in your program. There are two ways to do this—
Microsoft has an extension that allows main to take a wide character array: int wmain(int argc, wchar_t *argv[]); https://msdn.microsoft.com/en-us/library/6wd819wh.aspx
Call the Windows API to get the Unicode version of the command line wchar_t win_argv = (wchar_t)CommandLineToArgvW(GetCommandLineW(), &nargs); CommandLineToArgvW function (shellapi.h)
Read UTF-8 Everywhere for detailed information, particularly if you are supporting other operating systems.
I even tested in cygwin and I got the same result
this is because Lua itself does not use
GetCommandLineW
(I searched the source code and I could not find it), so no shell/console will solve it however you force it. something should be done from inside lua to solve it, and I am afraid that the only solution is to hack lua.c or create a dll that uses GetCommandLineW
, but I'm new to Lua (my third day) and I have basic experience with C, so I wanted to know whether there is an easier way, because I searched and I did not find anyone talking about this problem, so I thought that the problem is in my code, but it seems that the problem is in Lua (I hope to be wrong).
This is a known flaw in Windows terminal, Lua is not a unicode program, so Windows always passes command line arguments in the OEM encoding to Lua regardless of the active code page.
Workaround 1 is to change the OEM encoding to UTF-8: https://superuser.com/a/1435645/995824. Note that this is a global setting.
Workaround 2 is to read a UTF-8 encoded file instead of passing it through the command line. Here is an auto version, it creates a temp file and pass the argument through redirection (suprise, Windows remains encoding in redirection):
run.cmd:
@ECHO OFF
CHCP 65001
ECHO %1 > temp.txt
lua.exe test.lua < temp.txt
DEL temp.txt
test.lua:
local txt = io.stdin:read('l')
-- note: txt will contain a trailing \r
run:
run.cmd かسжГ > test.txt
Workaround 3 is to update the main function in lua.c
to use CommandLineToArgvW, and recompile lua.
remove main function and replace it with
// int main (int argc, char **argv) {
// Why static?
// We make real_main() static to limit its scope to this .c file only. It’s a helper, not a global function — so we keep it private.
static int real_main(int argc, char **argv) {
int status, result;
lua_State *L = luaL_newstate(); /* create state */
if (L == NULL) {
l_message(argv[0], "cannot create state: not enough memory");
return EXIT_FAILURE;
}
lua_gc(L, LUA_GCSTOP); /* stop GC while building state */
lua_pushcfunction(L, &pmain); /* to call 'pmain' in protected mode */
lua_pushinteger(L, argc); /* 1st argument */
lua_pushlightuserdata(L, argv); /* 2nd argument */
status = lua_pcall(L, 2, 1, 0); /* do the call */
result = lua_toboolean(L, -1); /* get result */
report(L, status);
lua_close(L);
return (result && status == LUA_OK) ? EXIT_SUCCESS : EXIT_FAILURE;
}
// Prevent <windows.h> from pulling in almost *every*
// Windows header (saves compile time & namespace pollution).
// What does WIN32_LEAN_AND_MEAN mean?
// When you #include <windows.h>, by default it drags in a huge set of APIs (graphics, networking, multimedia, COM, etc.), which:
// Slows down compilation,
// Pollutes the global namespace with tons of macros and typedefs,
// Can lead to name clashes or unexpected dependencies.
// By defining WIN32_LEAN_AND_MEAN before including windows.h, you tell it to skip loading the least used parts of the API, giving you a slimmer, faster compile with only the core kernel and user interface functions. It doesn’t change functionality — it just leaves out the rarely needed headers so your build is cleaner and faster.
#define WIN32_LEAN_AND_MEAN
#include <windows.h> // for CommandLineToArgvW, LocalFree
#include <shellapi.h> // for CommandLineToArgvW prototype
#include <stdlib.h> // for malloc/free, EXIT_FAILURE
// Our single entry point: always a console ‘main’, no subsystem tricks.
// int main(int argc_unused, char **argv_unused) {
int main(void) {
// 1) Grab the *true* Unicode command line from the OS.
// CRT’s argv is already lost if launched in UTF 8 mode,
// so we ask the shell directly.
int wargc;
wchar_t **wargv = CommandLineToArgvW(GetCommandLineW(), &wargc);
if (!wargv) return EXIT_FAILURE;
// 2) Build a parallel UTF 8 argv[] array of char*.
// We’ll pass this to the Lua engine.
char **argv = malloc((wargc + 1) * sizeof(char*));
if (!argv) {
LocalFree(wargv);
return EXIT_FAILURE;
}
for (int i = 0; i < wargc; i++) {
// Figure out how many bytes we need in UTF 8 (including the '\0').
int need = WideCharToMultiByte(
CP_UTF8, // convert *to* UTF-8
0, // default flags
wargv[i], -1, // input wchar_t*
NULL, 0, // output buffer = NULL → length only
NULL, NULL // no fallback chars
);
if (need <= 0) {
argv[i] = NULL;
continue;
}
// Allocate & convert
argv[i] = malloc(need);
if (!argv[i]) {
// on malloc failure, clean up what we already did
for (int j = 0; j < i; j++) free(argv[j]);
free(argv);
LocalFree(wargv);
return EXIT_FAILURE;
}
WideCharToMultiByte(
CP_UTF8, 0, wargv[i], -1,
argv[i], need,
NULL, NULL
);
}
argv[wargc] = NULL; // null terminate the list
// 3) Call the original Lua startup, passing our UTF 8 args.
int result = real_main(wargc, argv);
// 4) Free everything
for (int i = 0; i < wargc; i++) free(argv[i]);
free(argv);
LocalFree(wargv);
return result;
}
then edit Makefile the one inside src to add -lshell32 to gcc
so
$(CC) -shared -o
becomes
$(CC) -shared -lshell32 -o
then call chcp 65001
before you call lua