linuxperlencodingone-linerperlop

Why using utf8 patterns within perl substitute(s) and match(m) operators within one-liners does not work?


I found this issue when using Perl's one-liners for substituting some utf8 text in files. I am aware of hacks at How to handle utf8 on the command line (using Perl or Python)?. They don't work for this case. OS is linux, locate is set to utf8

# make file to contain pattern
$echo Текст на юникоде>file
$cat file
Текст на юникоде
# also grep finds it
$grep "Текст на юникоде" file
Текст на юникоде
# different perl hacks mentioned at reference question don't work:
$perl -C63 -n -e "print if m{Текст на юникоде}" file
# does not show anything
$perl -Mutf8 -n -e "print if m{Текст на юникоде}" file
# does not show anything
# although it handles parameters correctly
$perl -e 'print "$ARGV[0]\n"' "Текст на юникоде"
Текст на юникоде
# and inside -e options as well
$perl -e 'print "Текст на юникоде\n"'
Текст на юникоде
# when create perl script to find the pattern, it works:
echo "while (<>) {print if m{Текст на юникоде}}">find.pl
$cat find.pl
while (<>) {print if m{Текст на юникоде}}
$perl find.pl file
Текст на юникоде
# and even this strange way it works:
perl -ne '$m="Текст на юникоде";print if m{$m}' file
Текст на юникоде

So here is my question: is there any more simple solution to use utf8 patterns form m and s operators withing perl one-liners and why simple approach does not work?

Thank you!

Just in case:

$uname -a
Linux ubuntu16-pereval 4.4.0-190-generic #220-Ubuntu SMP Fri Aug 28 23:02:15 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
$locale
LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=en_US.UTF-8

Solution

  • perl -C63 -n -e "print if m{Текст на юникоде}" file
    

    -C63 applies various flags to tell Perl that input and output files are in UTF8.

    perl -C63 -n -e "print if m{Текст на юникоде}" file
    

    -Mutf8 tells the Perl compiler that your source code is in UTF8.

    -C63 effects how Perl sees the data in file. -Mutf8 effects how Perl sees the code in your -e option. In order for Perl to understand that the input file and the source code should both be interpreted as UTF8, you need both options.

    $ perl -Mutf8 -C63 -n -e "print if m{Текст на юникоде}" file
    Текст на юникоде
    

    Update: Oh, and I should probably add that the simplest option works as well (but for all the wrong reasons!)

    $ perl -n -e "print if m{Текст на юникоде}" file
    Текст на юникоде
    

    In this case, it works because Perl interprets both the input and the source code as being made up of single-byte Latin-1 characters. Please don't do this :-)