perlinternationalizationcmdactivestate

How can I obtain correct non-ASCII command-line arguments in ActiveState Perl?


Running the following command

perl -e "for (my $i = 0; $i < length($ARGV[0]); $i++) {print ord(substr($ARGV[0], $i, 1)), qq{\n}; }" αβγδεζ

on a Windows 7 cmd window with ActiveState Perl v5.14.2 produces the following result:

97
223
63
100
101
63

The above values are nonsensical and don't correspond to any known encoding, so trying to decode them with the approach recommended in How can I treat command-line arguments as UTF-8 in Perl? doesn't help. Changing the command window active code page doesn't change the results.


Solution

  • Your system, like every Windows system I know, uses by default the 1252 ANSI code page, so you could try to use

    use Encode qw( decode );
    @ARGV = map { decode('cp1252', $_) } @ARGV;
    

    Note that cp1252 cannot represent all of those characters, which is why the console and thus Perl actually receives

    There is a "Wide" interface for passing (almost) any Unicode code point to a program, but

    1. The Wide interface is not used when you type in a command at the prompt.
    2. Perl uses the ANSI interface to fetch the parameters, so even if you started Perl using the Wide interface, the parameters would get downgraded to ANSI when Perl fetches them.

    Sorry, but this is a "you can't" type of situation. You need a different approach. Diomidis Spinellis suggests changing your system's ANSI code page as follows in Win7:

    1. Control Panel
    2. Region and Language
    3. Administrative
    4. Language for non-Unicode programs
    5. Set the Current language for non-Unicode programs to the language associated with the specific characters (Greek in your case).

    At this point, you'd use the encoding of the ANSI code page associated with the new selected encoding instead of cp1252 (cp1253 for Greek).

    use Encode qw( decode );
    @ARGV = map { decode('cp1253', $_) } @ARGV;
    

    Note that using chcp to modify the code page used within the console window does not affect the code page in which Perl receives its arguments, which is always an ANSI code page. See the examples below (cp737 is the Greek OEM code page, and cp1253 is the Greek ANSI code page. You can find the encodings labeled as 37 and M7 in this document.)

    C:\>chcp 737
    Active code page: 737
    
    C:\>echo αβγδεζ | od -t x1
    0000000 98 99 9a 9b 9c 9d 20 0d 0a
    
    C:\>perl -e "print map sprintf('%x ', ord($_)), split(//, $ARGV[0])" αβγδεζ
    e1 e2 e3 e4 e5 e6
    
    C:\>chcp 1253
    Active code page: 1253
    
    C:\>echo αβγδεζ | od -t x1
    0000000 e1 e2 e3 e4 e5 e6 20 0d 0a
    
    C:\>perl -e "print map sprintf('%x ', ord($_)), split(//, $ARGV[0])" αβγδεζ
    e1 e2 e3 e4 e5 e6