windowsperlencodingutf-8

Passing UTF-8 arguments to commands in Perl on Windows


I am trying to build a template for Perl scripts so that they would do at least most of the basic things right with UTF-8 and would work equally well on Linux and Windows machines.

One thing in particular escaped me for a while: the difficulty of passing UTF-8 strings as arguments to system commands. It seems to me that there is no way not to have arguments double UTF-8 encoded before they reach the shell (that is, I understand that there is a layer that ignores that the command and its arguments are already properly UTF-8 encoded, takes it for Latin-1 or something of the sorts, and encodes it again as UTF-8). I could not find a way to cleanly avoid this layer of encoding.

Take this script:

#!/usr/bin/perl

use v5.14;

use utf8;
use feature 'unicode_strings';
use feature 'fc';
use open ':std', ':encoding(UTF-8)';
use strict;
use warnings;
use warnings FATAL => 'utf8';

use constant IS_WINDOWS => $^O eq 'MSWin32';

# Set proper locale
$ENV{'LC_ALL'} = 'C.UTF-8';

# Set UTF-8 code page on Windows
if (IS_WINDOWS) {
  system("chcp 65001 > nul 2>&1");
};

# Use Win32::Unicode::Process on Windows
if (IS_WINDOWS) {
  eval {
    require Win32::Unicode::Process;
    Win32::Unicode::Process->import;
  };
  if ($@) {
    die "Could not load Win32::Unicode::Process: $@";
  };
};


# Show the empty directory
print "---\n" . `ls -1 system*` . "---\n";

my $utf = "test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽";

# Works fine on Linux but not on Windows
print "System (touch) exit code: " . system("touch system-$utf > touch-system.txt 2>&1") . "\n";
print "System (echo) exit code: " . system("echo system-$utf > echo-system.txt 2>&1") . "\n";

if (IS_WINDOWS) {
  # Works fine on Windows
  print "SystemW (touch) exit code: " . systemW("touch systemW-$utf > touch-systemW.txt 2>&1") . "\n";
  print "SystemW (echo) exit code: " . systemW("echo systemW-$utf > echo-systemW.txt 2>&1") . "\n";
};

# Show the directory with the new the files
print "---\n" . `ls -1 system*` . "---\n";

exit;

On Linux, everything is fine: the file created with touch through system() has a UTF-8 encoded filename and the content of the file created with echo is correctly UTF-8 encoded.

Yet, I found no way to get the same code to behave correctly on Windows. There, the output of the script is this:

---
---
System (touch) exit code: 0
System (echo) exit code: 0
SystemW (touch) exit code: 
SystemW (echo) exit code: 
---
system-test-теÑÑ‚-מבחן-परीकà¥à¤·à¤£-😊-ð“½ð“®ð“¼ð“½
systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽
---

As the script shows, the only way I could make it work is to use Win32::Unicode::Process::systemW() to replace system(). The file systemW-test-тест-מבחן-परीक्षण-😊-𝓽𝓮𝓼𝓽 is correctly named and the content of echo-systemW.txt is correctly encoded in UTF-8.

My questions are these:

  1. Is there a way to avoid using systemW() and keep the code identical for Linux and Windows but somehow remove this layer that double-encodes the system command? In other words, is this the only good way to go?

  2. If this is the right way, I am not sure how to obtain the similarly correct behaviour for backticks. They have the same problem as system() but I have no idea how to capture the output of a command with systemW() aside from piping it into a temporary file and reading that at the end (possible, of course, but maybe not great).


Solution

  • Avoiding systemW() for Unified Behavior on Linux and Windows: Unfortunately, Windows' cmd.exe does not natively support UTF-8 in the same way that Linux shells do. Even with chcp 65001, which sets the console code page to UTF-8, there are quirks and inconsistencies. The double-encoding issue arises because the Perl system() function and backticks (```) on Windows internally use ANSI APIs, which do not fully respect UTF-8.

    To achieve consistent behavior, you must use the wide-character APIs, such as systemW() from Win32::Unicode::Process. There's no direct way around this limitation with Perl's standard system() on Windows.

    Handling Backticks with Wide-Character APIs: As you've identified, Perl's backticks also rely on the ANSI API, and there's no direct equivalent of systemW() for capturing output. However, you can use the following workaround:

    Use temporary files for command output, as you mentioned. Alternatively, leverage Win32::Unicode::Process to implement a custom backtick-like behavior using wide-character APIs.