I'm writing a program that uses gsed
to extract multibyte charactors from csv file.
It works well with csv file encoded UTF-8, but it doesn't work with csv file encoded SHIFT_JIS.
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8 | gsed -r 's/"(.*)","(.*)"/\1 \2/'
こんにちは hello%
test % cat sjis_sample.csv | gsed -r 's/"(.*)","(.*)"/\1 \2/' | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
LINE 1:
Read file with UTF-8
LINE 2:
Extracted text contents from csv file after converting encoding from SHIFT_JIS to UTF-8
-> Works well
LINE 3:
Extracted text contents from csv file without converting encoding
-> It seems that `gsed` failed to capture text contents with match pattern.
Does anybody know how to use gsed
for SHIFT_JIS encoded file?
Thank you.
% gsed --version
gsed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.
This sed program was built without SELinux support.
GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
test % locale
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
Thanks to @KamilCuk
GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:
LC_ALL=C sed ....
I set LANG
instead of LC_ALL
as C
because I could not set LC_ALL
as C
.
test % cat sjis_convert.sh
#!/bin/bash
LANG=C
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
test % ./sjis_convert.sh
こんにちは hello%
I could not set C
to LC_ALL
.
test % cat sjis_convert.sh
#!/bin/bash
LC_ALL=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
"こんにちは","hello"
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=
Instead, I set C
to LANG
and it worked.
test % cat ./sjis_convert.sh
#!/bin/bash
LANG=C
locale
echo ''
cat sjis_sample.csv |\
gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
iconv -f shift_jis -t utf-8
echo ''
locale
test % ./sjis_convert.sh
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
こんにちは hello
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=
GNU sed
is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent "
in Shift_JIS
and feed that to sed
) use:
LC_ALL=C sed ....
If you want to work with UTF-8, set UTF-8 locale, which most probably is your default:
LC_ALL=en_US.UTF-8 sed ...
And if you want to work with any other locale, tell it to sed:
LC_ALL=ja_JP.Shift_JIS sed ...