perlunicode

Perl ord() and chr() working with unicode


To my horror, I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear:

Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.

Indeed I can print a smiley using

perl -e 'print chr(0x263a)'

but things like chr(0x00C0) do not work. I see that my Perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.

I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.

So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?


The first answer I've got explains quite everything about IO, but I still don't understand why

#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';

print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";

print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";

prints

ne1 - eq1
match1 - no_match2

It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).


Solution

  • First,

    perl -le'print chr(0x263A);'
    

    is buggy. Perl even tells you as much:

    Wide character in print at -e line 1.
    

    That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:

    perl -le'print chr(0x263A);'
    
    perl -le'print chr(0x00C0);'
    

    To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.

    $ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
    ☺
    
    $ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
    À
    

    Now on to the "why".

    File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:

    $ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
    0000000 00 65 c0 f0
    0000004
    

    This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.

    By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.

    $ perl -e'
       use open ":std", ":encoding(UTF-8)";
       print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
    ' | od -t x1
    0000000 00 65 c3 80 c3 b0 e2 98 ba
    0000011
    

    Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.

    A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.

    Here are a few example of operators that do assign meaning to the strings they receive as operands:


    So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?

    chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.


    Regarding the question you've added:

    There are problems with the encoding pragma. I recommend against using it. Instead, use

    use open ':std', ':encoding(UTF-8)';
    

    That'll fix one of the problems. The other problem you are encountering is with

    chr(0x00C0) =~ /\w/
    

    It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:

    use 5.014;    # use 5.012; *might* suffice.
    

    A workaround that works as far back as 5.8:

    my $x = chr(0x00C0);
    utf8::upgrade($x);
    $x =~ /\w/