perlunicode

sprintf and multi-column Unicode characters


I guess I went down a unicode rabbithole, and I need help getting out. I was testing how to line up things with sprintf and unicode strings, so I was kind of expecting this:

use Encode qw/decode encode/;
use utf8;

$\ = "\n"; $, = "\t";

open my $uni, "unicode_strings.txt";
my @in = map { chomp; $_ } <$uni>;

my @l = map {
    my $decoded = utf8::is_utf8($input) ? $input : decode("UTF-8", $_);
    [
     sprintf ("%-32s", $decoded),
     sprintf ("%02i", length($decoded)),
     sprintf ("%02i", length($_)),
    ]
} @in;

print encode "UTF-8", $_ for map { join " | ", $_->@* } @l;

with this input:

normal
übung
schön
fähig
niño
crème brûlée
smörgåsbord
добрый день
😊
🌍
你好
こんにちは
안녕하세요
مرحبا
שָׁלוֹם
ज़िंदगी

to line up things neatly.

Instead things work ok up until добрый день and then get messed up like this (I want the columns to line up, we'll deal with left to right and right to left later):

normal                           | 06 | 06
übung                            | 05 | 06
schön                            | 05 | 06
fähig                            | 05 | 06
niño                             | 04 | 05
crème brûlée                     | 12 | 15
smörgåsbord                      | 11 | 13
добрый день                      | 11 | 21
😊                                | 01 | 04
🌍                                | 01 | 04
你好                               | 02 | 06
こんにちは                            | 05 | 15
안녕하세요                            | 05 | 15
مرحبا                            | 05 | 10
שָׁלוֹם                          | 07 | 14
ज़िंदगी                          | 07 | 21

I've been poking around at Unicode::GCString but adjusting for columns and so on doesn't seem to help much.

Any ideas?


Solution

  • Text::CharWidth's mbswidth provides the number of columns a string should occupy.

    Should. The quality of the results is going to come down to the font and renderer.

    use open ':std', ':encoding(locale)';
    
    while ( <> ) {
       chomp;
       printf "<%s%s>\n", $_, " " x ( 15 - mbswidth( $_ ) );
    }
    

    My console:

    Rendered in console

    Note that this appears to be spot on except where random white space appears inside the text. The length is off by exactly the size of that whitespace. So mbswidth is definitely counting correctly, but there's some kind of display issue.

    My browser displaying a code block on StackOverflow:

    Rendered in browser

    And here we see what happens when the font being used doesn't have the glyphs you are trying to print: Glyphs can be borrowed from other fonts. This leads to discrepancies. And not always by a whole number of a columns.

    Highlight of discrepancies

    There can't possibly exists something that returns the correct number of columns when some characters occupy slightly more or slightly less than a columns. mbswidth is the best you're going to get if you're working with columns.

    Your browser as displayed by StackOverflow:

    <normal         >
    <übung          >
    <schön          >
    <fähig          >
    <niño           >
    <crème brûlée   >
    <smörgåsbord    >
    <добрый день    >
    <😊             >
    <🌍             >
    <你好           >
    <こんにちは     >
    <안녕하세요     >
    <مرحبا          >
    <שָׁלוֹם           >
    <ज़िंदगी          >