I'm trying to understand the difference between UTF-8
, utf-8
, and UTF8
as described at the very bottom of Encode.pm.
#!/usr/bin/perl
use strict;
use warnings;
use Encode;
printf "Perl: %s, Encode.pm: %s\n", $^V, $Encode::VERSION;
# https://metacpan.org/pod/Encode#UTF-8-vs.-utf8-vs.-UTF8
encode("utf8", "\x{FFFF_FFFF}", 1); # okay
encode("UTF-8", "\x{FFFF_FFFF}", 1); # croaks
1;
Here's the output:
Code point 0xFFFFFFFF is not Unicode, requires a Perl extension, and so is not portable at test.pl line 9.
Code point 0xFFFFFFFF is not Unicode, requires a Perl extension, and so is not portable at test.pl line 10.
Perl: v5.40.2, Encode.pm: 3.21
Modification of a read-only value attempted at test.pl line 9.
It looks like both lines generate the exact same warning, and somehow the third argument CHECK
is being modified? Could someone please explain what's going on with this code?
UTF-8 and utf-8 both refer to the standard encoding. (Character encoding names are usually case-insensitive.)
utf8 is a Perl-specific extension of UTF-8. While UTF-8 can only encode valid Code Points, utf8 can encode any 72-bit number (though Perl only supports 32-bit or 64-bit numbers).
$ perl -e'
use Encode qw( decode encode );
my $enc = shift;
my $c = chr( 0xFFFF_FFFF );
$c = decode( $enc, encode( $enc, $c ) );
printf "%vX\n", $c;
' UTF-8
FFFD
$ perl -e'
use Encode qw( decode encode );
my $enc = shift;
my $c = chr( 0xFFFF_FFFF );
$c = decode( $enc, encode( $enc, $c ) );
printf "%vX\n", $c;
' utf8
FFFFFFFF
U+FFFD REPLACEMENT CHARACTER is used for characters that can't be encoded.
Because LEAVE_SRC
wasn't provided, encode
attempts to modify the second argument. That fails since you provided a constant.