I am trying to build a template for Perl scripts so that they would do at least most of the basic things right with UTF-8 and would work equally well on Linux and Windows machines.
One thing in particular escaped me for a while: the difficulty of passing UTF-8 strings as arguments to system commands. It seems to me that there is no way not to have arguments double UTF-8 encoded before they reach the shell (that is, I understand that there is a layer that ignores that the command and its arguments are already properly UTF-8 encoded, takes it for Latin-1 or something of the sorts, and encodes it again as UTF-8). I could not find a way to cleanly avoid this layer of encoding.
Take this script:
#!/usr/bin/perl
use v5.14;
use utf8;
use feature 'unicode_strings';
use feature 'fc';
use open ':std', ':encoding(UTF-8)';
use strict;
use warnings;
use warnings FATAL => 'utf8';
use constant IS_WINDOWS => $^O eq 'MSWin32';
# Set proper locale
$ENV{'LC_ALL'} = 'C.UTF-8';
# Set UTF-8 code page on Windows
if (IS_WINDOWS) {
system("chcp 65001 > nul 2>&1");
};
# Use Win32::Unicode::Process on Windows
if (IS_WINDOWS) {
eval {
require Win32::Unicode::Process;
Win32::Unicode::Process->import;
};
if ($@) {
die "Could not load Win32::Unicode::Process: $@";
};
};
# Show the empty directory
print "---\n" . `ls -1 system*` . "---\n";
my $utf = "test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽";
# Works fine on Linux but not on Windows
print "System (touch) exit code: " . system("touch system-$utf > touch-system.txt 2>&1") . "\n";
print "System (echo) exit code: " . system("echo system-$utf > echo-system.txt 2>&1") . "\n";
if (IS_WINDOWS) {
# Works fine on Windows
print "SystemW (touch) exit code: " . systemW("touch systemW-$utf > touch-systemW.txt 2>&1") . "\n";
print "SystemW (echo) exit code: " . systemW("echo systemW-$utf > echo-systemW.txt 2>&1") . "\n";
};
# Show the directory with the new the files
print "---\n" . `ls -1 system*` . "---\n";
exit;
On Linux, everything is fine: the file created with touch
through system()
has a UTF-8 encoded filename and the content of the file created with echo
is correctly UTF-8 encoded.
Yet, I found no way to get the same code to behave correctly on Windows. There, the output of the script is this:
---
---
System (touch) exit code: 0
System (echo) exit code: 0
SystemW (touch) exit code:
SystemW (echo) exit code:
---
system-test-теÑÑ‚-מבחן-परीकà¥à¤·à¤£-😊-ð“½ð“®ð“¼ð“½
systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽
---
As the script shows, the only way I could make it work is to use Win32::Unicode::Process::systemW()
to replace system()
. The file systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽
is correctly named and the content of echo-systemW.txt
is correctly encoded in UTF-8.
My questions are these:
Is there a way to avoid using systemW()
and keep the code identical for Linux and Windows but somehow remove this layer that double-encodes the system command? In other words, is this the only good way to go?
If this is the right way, I am not sure how to obtain the similarly correct behaviour for backticks. They have the same problem as system()
but I have no idea how to capture the output of a command with systemW()
aside from piping it into a temporary file and reading that at the end (possible, of course, but maybe not great).
Avoiding systemW() for Unified Behavior on Linux and Windows: Unfortunately, Windows' cmd.exe does not natively support UTF-8 in the same way that Linux shells do. Even with chcp 65001, which sets the console code page to UTF-8, there are quirks and inconsistencies. The double-encoding issue arises because the Perl system() function and backticks (```) on Windows internally use ANSI APIs, which do not fully respect UTF-8.
To achieve consistent behavior, you must use the wide-character APIs, such as systemW() from Win32::Unicode::Process. There's no direct way around this limitation with Perl's standard system() on Windows.
Handling Backticks with Wide-Character APIs: As you've identified, Perl's backticks also rely on the ANSI API, and there's no direct equivalent of systemW() for capturing output. However, you can use the following workaround:
Use temporary files for command output, as you mentioned. Alternatively, leverage Win32::Unicode::Process to implement a custom backtick-like behavior using wide-character APIs.