perlunicodeutf-8tie

Why is my Perl program failing with Tie::File and Unicode/UTF-8 encoding?


I am working on a project which deals with data in foreign languages. My Perl scripts were running fine.

I then wanted to use Tie::File, since this is a neat concept (and saves time and coding).

It seems that Tie:File is failing under Unicode/UTF-8 (unless I am missing something).

Here is a program which depicts the problem: (The data is a mix of English, Greek and Hebrew):

use strict;
 use warnings;
 use 5.014; 
 use Win32::Console;
 use autodie; 
 use warnings qw< FATAL utf8 >;
 use Carp;
 use Carp::Always;
 use utf8;
 use feature        qw< unicode_strings>;
 use charnames      qw< :full>;
use Tie::File;

my ($i);
my ( $FileName);
my (@Tied);
binmode STDOUT, ':unix:utf8';
binmode STDERR, ':unix:utf8';
binmode $DB::OUT, ':unix:utf8' if $DB::OUT; # for the debugger
Win32::Console::OutputCP(65001);         # Set the console code page to UTF8

$FileName = 'E:\\My Documents\\Technical\\Perl\\Eclipse workspace\\Work\\'.
        'Tie File test res.txt';
tie @Tied, 'Tie::File', $FileName, recsep => "\x0D\x0A", discipline => ':encoding(utf8)'
            or confess 'tie @Tied failed';
$i =0;
while (<DATA>) {
    chomp;
    $Tied[$i] = $_;
    ++$i;
} # end while (<DATA>) 
$i =0;
foreach (@Tied) {
    say "$i $Tied[$i]";
    ++$i;
} # end foreach (@Tied)
untie $FileName;
__DATA__
τι κάνετε;
πάρτε το ή αφήστε το
שלום חברים
abc לא כןכן efg
מתי ולאן This is it
מעכשיו לעכשיו 
Σήμερα είναι Τρίτη
Θέλω να φάω
τι κάνετε;
שורה מס' 5

This produces a huge cascade of warnings: here is some:

utf8 "\xCE" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31
utf8 "\xCF" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31
utf8 "\xD7" does not map to Unicode at F:/Win7programs/Dwimperl/perl/lib/Tie/File.pm line 917
        Tie::File::_read_record('Tie::File=HASH(0x24cb72c)') called at F:/Win7programs/Dwimper
l/perl/lib/Tie/File.pm line 175
        Tie::File::_fetch('Tie::File=HASH(0x24cb72c)', 0) called at F:/Win7programs/Dwimperl/p
erl/lib/Tie/File.pm line 210
        Tie::File::STORE('Tie::File=HASH(0x24cb72c)', 0, 'τι κάνετε;') called at tie file test
.pl line 31

Then it prints this on STDOUT:

0 τι κάνετε;
1 πάρτε το ή αφήστε το
2 שלום חברים
3 abc לא כןכן efg
4 מתי ולאן This is it
5 מעכשיו לעכשיו
6 Σήμερα είναι Τρίτη
7 Θέλω να φάω
8 τι κάνετε;
9 שורה מס' 5
10
11
12
13
14 \xA4\xΘέλω\xA8\x

15
16
17
18

19

Note that the first 10 lines are OK, but lines 10 through 19 came from nowhere!? In addition, the output of the tied file contains corrupted data:

 τι κάνϏN͏Ŏՠτήστε של חברءbc לؗܗࠗܗߠeמתולאן This is מעיו לעכ؎Ďώݎ֏ναι ΤρΘέώގѠφϏŎ٠κτε;שרה מס'



\xA4\xΘέλω\xA8\x

Something is very wrong here. Either I am missing something, or Tie:File can't cope with Unicode/UTF-8? I am running Strawberry Perl 5.14 on a Windows 7 system.

Many TIA - Helen

Note: posted on http://perlmonks.org/?node_id=1002104, too


Solution

  • The suggestion I would make depends very much on the actual problem you're trying to solve. Looking at this question in isolation, I would not have so much encoding / decoding 'magic' and would simply use the raw bytes (as the script doesn't need to know anything about the characters themselves for this task). The below produces the expected result given the input and output you described.

    use v5.014;
    use warnings;
    use autodie;
    
    use Carp::Always;
    use Tie::File;
    
    my $file_in = 'test_in.txt';
    my $file_out = 'test_tie.txt';
    
    unlink $file_out;
    
    tie my @tied, 'Tie::File', $file_out, recsep => "\x0D\x0A" or die 'tie failed';
    
    open my $fh, '<', $file_in;
    while (my $line = <$fh>) {
        chomp $line;
        push @tied, $line;
    }
    close $fh;
    
    my $i = 0;
    say $i++ . ' ' . $_ foreach @tied;
    
    untie @tied;
    

    However, you probably do want to do some processing on that text in the middle. In which case you want decoded characters. As I see it there are two options:

    1. Encode manually before handing off to the tied array
    2. Figure out what the issue is with Tie::File

    Number 2 is probably non-trivial - a quick scan of the Tie::File source and it looks like it assumes it will always be given bytes. The only part that you can seemingly affect is the binmode at https://metacpan.org/source/TODDR/Tie-File-0.98/lib/Tie/File.pm#L111 - which you are doing.

    Tie::File does a lot of seek calls, perldoc has this to say on seek ( http://perldoc.perl.org/functions/seek.html ):

    Note the in bytes: even if the filehandle has been set to operate on characters (for example by using the :encoding(utf8) open layer), tell() will return byte offsets, not character offsets (because implementing that would render seek() and tell() rather slow).

    So it appears that Tie::File is using character lengths to determine its byte offsets for records. Therefore it can end up in the middle of a UTF-8 character sequence. This seems a likely cause for your errors.

    In general, I stay away from binmode when relying on an external module to read/write to a file handle - in this case I would have a simple sub calling Encode::encode('UTF-8', ...) on the data before pushing onto @tied.

    Exception is where the module's documentation clearly states the behaviour for decoded data or if the source is simple enough for me to verify the behaviour.