This question is restricted to strings devoid of whitespace, written intentionally for a human to read them.
I don't care about NUL or other characters that one would not find in a piece of text set down for human consumption.
Also, I don't care about "pathological" cases such as
#!/usr/bin/env perl
use strict; use warnings;
use feature 'say';
say 'dog', "\r", 'rat';
say 'a', "\b", 'z';
This question would be useful, for instance, for generating nicely centered lines of text, when the text is not all ASCII.
In the Perl script below, we look first at strings that take up 1 column, then 2 columns, 3 columns, etc.
As we see from running this code, neither the number of bytes, nor the length of an array created by splitting a string at \B
, reliably tells us how many columns a string or a character will take up when printed. Is there a way to get this number?
#!/usr/bin/env perl
use strict; use warnings;
use feature 'say';
while(<DATA>)
{
say '------------------------------------------';
print;
$_=~s/\s//g;
my@array=split /\B/,$_;
say length $_,' bytes, ',scalar@array,' components';
}
__DATA__
a
é
ø
ü
α
ά
∩
⊃
≈
≠
好
üb
üü
dog
Voß
café
Schwiizertüütsch
The number of columns used by a terminal to print text is directly determined from the number of "characters" printed, where each may take 0, 1, or 2 columns to print.
Those are the logical Unicode characters, extended grapheme clusters. They may be character sequences, often a base character and its combining diacritical marks (accents), or may have a single codepoint, but represent one character from the particular writing system/language.
What is needed then is to break the input into characters in a way that respects Unicode and find out how many columns each needs. (Well, or use a library that does that.)
One way to look into the width is by testing with regex the East_Asian_Width
property, with \p{East_Asian_Width=Wide}
or \p{EA=W}
. See perluniprops (+ perluniintro, perlunicode).
Or, go to the core module Unicode::UCD, which interfaces the Unicode Character Database and has all properties.
The values can be: Neutral (not East Asian), Wide, Ambiguous, Narrow, Fullwidth, and Halfwidth -- what always resolves into two, narrow or wide, depending on context. See all detail in the Unicode Standard Annex for East Asian Width, UAX #11, or see the list in the perluniprops
linked above, which also shows how often they are found; the first three listed are represented incomparably more often than the rest.
The odd man out in that list is Ambiguous, which can be either wide or narrow, depending on the context of its use (whether it is East Asian or not), what includes all kind of detail; see the link. Given the seeming need in this question, I'll leave that out for now and treat it as narrow. Then the only property value that requires 2 columns would be Wide
.
An example
use warnings;
use strict;
use feature 'say';
use List::Util qw(sum);
use utf8;
use open qw(:std :encoding(UTF-8));
my @w = qw(a é ø ü α ά ∩ ⊃ ≈ ≠ 好 üb üü dog Voß café Schwiizertüütsch);
foreach my $word (@w) {
my $cols = sum map { /\p{EA=W}/ ? 2 : 1 } split '', $word;
say "$word needs $cols";
}
If we were to use the character database then we'd also need the character's codepoint, for example
use Unicode::UCD;
for my $ucp (unpack 'W', $word) {
my $eaw = charprop($ucp, "East_Asian_Width");
say $eaw;
}
This produces the names listed above.
We also need to enable Unicode support for the program (which is where the sample program from the question fails): The utf8 pragma is there since the source file itself has Unicode characters in it while the open pragma takes care of the standard streams.
Recall that all this leaves out the Ambiguous
ones, ie. we take them all to be narrow, which in general isn't correct.
The easiest way to improve on this is to use a library, Unicode::GCString, and with it things turn almost trivial
use Unicode::GCString;
foreach my $word (@w) {
say "$word needs ",
Unicode::GCString->new($word)->columns, " columns";
}
While this library is clearly put together thoughfully and is reputable, this does come with a caveat. By its latest meaningful update, of a number of years ago, the library uses Unicode Standard 8.0.0 (as stated in Unicode::LineBreak which it uses), which is badly out of date and this may cause errors (see an example here).
That is a nit in comparison with the fact that the manual approach doesn't even address the context of Ambiguous width. But an important issue in the context of this question is that this is an external module and which comes with a need for a C library (sombok), and which appears to not be updated any more.
Thanks to tchrist for bringing some of this up and thus triggering a far more detailed discussion.
If the intended use here does not involve Wide characters then it's pretty simple
while (<DATA>)
{
s/\s//g;
# Either of
my @chars = split '';
my @egc = /(\X)/g;
say "$_\t", 0+@chars, " chars (split), ", 0+@egc, " chars (regex, \\X)";
}
(The Chinese character in the provided input then won't be correct but presumably that is actually not needed in this use case.)
The \X
is one way to match a logical character, and also see \b{gcb}
on the same page. The capture (\X)
isn't needed here since we want all that's matched, so my @egc = /\X/g;
is fine. But it doesn't hurt and if there is more in the pattern one may need it so I put ()
in.
Please excuse my manners with 0+@ary
for array size as I'm trying to fit a line of code in display width for easier reading; by all means one should use scalar @ary
for this.
With the addition of the pragmas above the code in the question works well, and a statement from length I find instructive
Returns the length in characters of the value of EXPR.
...
Like all Perl character operations,length
normally deals in logical characters, not physical bytes.
(original emphasis)
Thanks to Thomas Dickey for calling the omission of character width discussion in the original post. Thanks to tchrist for comments.
Note added: See also sprintf and multi-column Unicode characters