perlutf-8codepagescp1251

Pattern code to guess the textfile codepage in perl?


Sorry for my noob's question, but I can't make the next script work. It's a useful peace of code, but I did not find the similar working one in net.

use Encode;
use Encode::Guess;

open (my $fhr, "<", "$folder\\$_")
  or die "Could not open file '$folder\\$_' $!";
my $data = do { local $/; <$fhr> };
close $fhr;

# It is either cp1251 or utf-8.
my  $enc = guess_encoding ($data, qw/cp1251/ );
ref($enc) or die "Can't guess: ".$enc->name();
print "Encode = ".$enc->name()."\n";

my $decoded = decode ($enc, $data);

The console output follows:

utf8 "\xCE" does not map to Unicode at double_fin.pl line 167, <$fhr> chunk 1.
Encode = utf8
Cannot decode string with wide characters at C:/Dwimperl/perl/lib/Encode.pm line 176.

What do I do wrong? Thank you, in advance.


Solution

  • The first message

    utf8 "\xCE" does not map to Unicode at double_fin.pl line 167, <$fhr> chunk 1
    

    comes when perl tries to decode the content of the file double_fin.pl as UTF-8. But the file contains invalid UTF-8, so you need to read in the file using the PerlIO layer :raw instead of using the :encoding(utf8) layer.

    See also http://perldoc.perl.org/PerlIO.html.