perlsvncharacter-encodingperl5.10

Encode - String bytes length


I've file file1.pl:

use strict;
use warnings;
use Encode;

my @flist = `svn diff --summarize ...`;

foreach my $file (@flist) {
  my $foo = "$one/$file";
  use bytes;
  print(bytes::length($one)."\n");
  print(bytes::length($file)."\n");
  print(bytes::length($foo)."\n");
}
# 76
# 31
# 108

and file2.pl with the same main logic. But in file2.pl the output is:

# 76
# 31
# 110 <-- ?

Both files have the same encoding (ISO-8859-1). For the same result as in file1.pl I've to use

my $foo = "$one/".decode('UTF-8', $file);

in file2.pl. What could be the reason for that difference or the requirement of decode('UTF-8', $file) in file2.pl? Seems to be related to What if I don't decode? but in which manner and only in file2.pl? Thx.

Perl v5.10.1


Solution

  • Don't use bytes.

    Use of this module for anything other than debugging purposes is strongly discouraged.

    bytes::length gets the length of the internal storage of a string. It's useless.


    What could be the reason for that difference

    $one and $file contained strings stored using different internal storage formats. One needed to be converted for a concatenation to occur.

    use strict;
    use warnings;
    use feature qw( say );
    use bytes qw( );
    use Encode qw( encode );
    
    sub dump_lengths {
       my $s = shift;
       say
          join " ",
             length( $s ),
             length( encode( "UTF-8", $s ) ),
             bytes::length( $s );
    }
                             # +------ Length of string
    my $x = chr( 0xE9 );     # | +---- Length of its UTF-8 encoding
    my $y = chr( 0x2660 );   # | | +-- Length of internal storage
                             # | | |
    dump_lengths( $x );      # 1 2 1
    dump_lengths( $y );      # 1 3 3
    
    my $z = $x . $y;
    
    dump_lengths( $z );      # 2 5 5