perlunicodeutf-8translationwidechar

Wide charectar in print for some Farsi text, but not others


I'm using Google Translate to convert some error codes into Farsi with Perl. Farsi is one such example, I've also found this issue in other languages---but for this discussion I'll stick to the single example:

The translated text of "Geometry data card error" works fine (Example 1) but translating "Appending a default 111 card" (Example 2) gives the "Wide character" error.

Both examples can be run from the terminal, they are just prints.

I've tried the usual things like these, but to no avail:

use utf8;
use open ':std', ':encoding(UTF-8)';
binmode STDOUT, ':encoding(UTF-8)';

Example 1: This works

perl -Mutf8 -le 'print "\x{d8}\x{ae}\x{d8}\x{b7}\x{d8}\x{a7}\x{db}\x{8c} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d8}\x{af}\x{d8}\x{a7}\x{d8}\x{af}\x{d9}\x{87} \x{d9}\x{87}\x{d9}\x{86}\x{d8}\x{af}\x{d8}\x{b3}\x{db}\x{8c}"'
خطای کارت داده هندسی

Example 2: This produces Wide char warnings and prints noise

perl -Mutf8 -le 'print "\x{d8}\x{a7}\x{d9}\x{81}\x{d8}\x{b2}\x{d9}\x{88}\x{d8}\x{af}\x{d9}\x{86} \x{db}\x{8c}\x{da}\x{a9} \x{da}\x{a9}\x{d8}\x{a7}\x{d8}\x{b1}\x{d8}\x{aa} \x{d9}\x{be}\x{db}\x{8c}\x{d8}\x{b4}\x{200c}\x{d9}\x{81}\x{d8}\x{b1}\x{d8}\x{b6} 111"'
Wide character in print at -e line 1.
# <terminal noise, not Farsi text>

Using Curl

If I do the same request with curl I get this:

curl 'https://translate.googleapis.com/translate_a/single?client=gtx&sl=auto&tl=fa&hl=fa&dt=t&ie=UTF-8&oe=UTF-8&otf=1&ssel=0&tsel=0&tk=xxxx&dt=dj&q=%41%70%70%65%6E%64%69%6E%67%20%61%20%64%65%66%61%75%6C%74%20%31%31%31%20%63%61%72%64'
[[["افزودن یک کارت پیش\u200cفرض 111","Appending a default 111 card",null,null,3,null,null,[[]],[[["982c75c78c6c8e6005ec3a4021a7f785","tea_GrecoIndoEuropeA_en2elfahykakumksq_2021q3.md"]]]]],null,"en",null,null,null,1,[],[["en"],null,[1],["en"]]]

Notice the \u200c in the JSON output above which is a "‌Zero Width Non-Joiner" unicode char. When JSON::from_json parses the \u200c it blows up:

perl -Mutf8 -MJSON -e 'print from_json("[\"\\u200c\"]")->[0];'
Wide character in print at -e line 1.

I can "fix" it like this:

my $c = $res->content;
$c =~ s/\\u[0-9a-f]{4}//;
my $json = from_json($c);

and then the output text is correct (right-to-left):

افزودن یک کارت پیشفرض 111

Question: What is going on here?


Solution

  • There's a lot of stuff going on here. I think a lot of it, especially in the first two examples, stems from not understanding the difference between perl's two string modes (byte oriented and Unicode codepoint oriented).

    Example 1 is a raw byte string holding bytes that happen to be UTF-8 encoded, and are passed through unchanged; as long as the terminal that's displaying the output is expecting UTF-8, they'll be rendered correctly. Example 2 has a 'wide' character (With a value greater than 255), making it a Unicode string, where each character represented by a \x{NN} number greater than 127 is a Unicode codepoint that is encoded as multiple bytes in UTF-8. Printing this causes mojibake and a warning because standard output is byte oriented without a translation layer.

    As I suggested in a comment, reading perluniintro (And the other unicode-related documentation) is a good start for learning how things work.


    But on to the actual task, extracting text from the JSON returned by your curl commands... I'd use jq instead if this is for a shell script:

    $ curl ... | jq -r '.[0][0][0]'
    افزودن یک کارت پیش‌فرض 111
    

    Compare to the equivalent perl one-liner:

    $ curl ... | perl -CS -MJSON -lne 'print from_json($_)->[0][0][0]'
    افزودن یک کارت پیش‌فرض 111
    

    The -CS argument tells perl that standard input, output, and error are all UTF-8 encoded. You could also use -CO to make that just standard output, and use decode_json() instead, which expects raw UTF-8 encoded bytes instead of a Unicode string.

    And in a script instead of a one-liner, using the OO interface to JSON and tuning how input strings should be encoded using its methods, plus the open pragma (Or binmode or an encoding layer for open) instead of the -C option, is the way to go.