perlunicodegetcperl-ioungetc

IO::Handle to get and unget unicode characters


I think I've run into a problem with Unicode and IO::Handle. It's very likely I'm doing something wrong. I want to get and unget individual unicode characters (not bytes) from an IO::Handle. But I'm getting a surprising error.

#!/usr/local/bin/perl

use 5.016;
use utf8;
use strict;
use warnings;

binmode(STDIN,  ':encoding(utf-8)');
binmode(STDOUT, ':encoding(utf-8)');
binmode(STDERR, ':encoding(utf-8)');

my $string = qq[a Å];
my $fh = IO::File->new();

$fh->open(\$string, '<:encoding(UTF-8)');

say $fh->getc(); # a
say $fh->getc(); # SPACE
say $fh->getc(); # Å LATIN CAPITAL LETTER A WITH RING ABOVE (U+00C5)
$fh->ungetc(ord("Å"));
say $fh->getc(); # should be A RING again.

The error message from the ungetc() line is "Malformed UTF-8 character (unexpected end of string) in say at unicode.pl line 21. "\x{00c5}" does not map to utf8 at unicode.pl line 21." But that's the correct hex for the character, and it should map to the character.

I used a hex editor to make sure that the bytes for A-RING are correct for UTF-8.

This seems to be a problem for any two-byte character.

The final say outputs '\xC5' (literally four characters: backslash, x, C, 5).

And I've tested this by reading from files instead of scalar variables. The result is the same.

This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-2level

And the script is saved in UTF-8. That was the first thing I checked.


Solution

  • I am pretty certain this proves there is a serious Unicode-processing bug going on, given that this output:

    perl5.16.0 ungettest
    ungettest 98896 @ Sun Jan  6 16:01:08 2013: sending normal line to kid
    ungettest 98896 @ Sun Jan  6 16:01:08 2013: await()ing kid
    ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting litte z
    ungettest 98897 @ Sun Jan  6 16:01:08 2013: ungetting big sigma
    ungettest 98897 @ Sun Jan  6 16:01:08 2013: kid looping on parental input
    98897: Unexpected fatalized warning: utf8 "\xA3" does not map to Unicode at ungettest line 40, <STDIN> line 1.
     at ungettest line 10, <STDIN> line 1.
        main::__ANON__('utf8 "\xA3" does not map to Unicode at ungettest line 40, <ST...') called at ungettest line 40
    98896: parent pclose failed: 65280,  at ungettest line 28.
    Exit 255
    

    is produced by this program:

    #!/usr/bin/env perl
    
    use v5.16;
    use strict;
    use warnings;
    use open qw( :utf8    :std );
    
    use Carp;
    
    $SIG{__WARN__} = sub {  confess "$$: Unexpected fatalized warning: @_" };
    
    sub ungetchar($) {
        my $char = shift();
        confess "$$: expected single character pushback, not <$char>" if length($char) != 1;
        STDIN->ungetc(ord $char);
    }
    
    sub debug {
        my $now = localtime(time());
        print STDERR "$0 $$ \@ $now: @_\n";
    }
    
    if (open(STDOUT, "|-")                          // confess "$$: cannot fork: $!") {
        $| = 1;
        debug("sending normal line to kid");
        say "From \N{greek:alpha} to \N{greek:omega}.";
        debug("await()ing kid");
        close(STDOUT)                               || confess "$$: parent pclose failed: $?, $!";
        debug("child finished, parent exiting normally");
        exit(0);
    }
    
    debug("ungetting litte z");
    ungetchar("z")                                  || confess "$$: ASCII ungetchar failed: $!";
    
    debug("ungetting big sigma");
    ungetchar("\N{greek:Sigma}")                    || confess "$$: Unicode ungetchar failed: $!";
    
    debug("kid looping on parental input");
    while (<STDIN>) {
        chomp;
        debug("kid got $_");
    }
    close(STDIN)                                    || confess "$$: child pclose failed: $?, $!";
    debug("parent closed pipe, child exiting normally");
    exit 0;