I was trying to set up a bash script, which uses the following code, to avoid odd path names getting split, which is what happened with a simple ':' etc. So I used the code for a SECTION SIGN eg § which is \u0A7
while read -d '' LINE ; do
some actions ;
done < <( find * -type f -printf "%p"$'\u0A7'"%s\0" ; )
But now I've given myself another headache, cut and csvtool don't work on this unicode character. (Well csvtool does split but leaves a trailing REPLACEMENT CHARACTER on the first col string. I thought I'd replace the csvtool with a bit of inline perl code, eg.
perl -n -e 'chomp ; @sx = split(/\N{SECTION SIGN}/,$_) ; print "$sx[0] :: $sx[1]\n" ; ' <<< 12345$'\u0A7'abcdef
12345� :: abcdef
This is the same result as with csvtool. The � is the REPLACEMENT CHARACTER, and I expect this is something to do with leaving behind a part of the 2-byte unicode char.
So, my question (and I've tried several things till my brain hurt) is what unicode encodes/decodes do I need to add to get this perl code to return the correct split strings?
The problem has two parts: The first part is that Perl, for compatibility reasons, doesn't enable "unicode mode" on its input and output by default. They are assumed to be in some 8-bit encoding instead.
The second part is that the section mark, U+00A7, happens to have the UTF-8 encoding 0xC2 0xA7
. See what happened there? Due to the (not entirely logical) way character names work in non-unicode mode, you ended up splitting on the byte 0xA7, which happens to be in your input in kind of the right place... but it leaves behind a lonely 0xC2
, and then when something tries to decode your final output as UTF-8, it sees an illegal 0xC2 0x20
(from the space in " :: "
) and spits out a replacement character.
The solution is simply to add -C
to your Perl flags. As long as you have a Perl from the past 15 years or so, that will tell it to treat stdin and stdout as UTF-8 as long as the prevailing locale appears to support that, and so it will split on the entire character instead of mangling it.
perl -C -n -e 'chomp ; @sx = split(/\N{SECTION SIGN}/,$_) ; print "$sx[0] :: $sx[1]\n" ; ' <<< 12345$'\u0A7'abcdef
12345 :: abcdef