linuxsedencodingshift-jis

gsed does not recognize SHIFT_JIS charactors


I'm writing a program that uses gsed to extract multibyte charactors from csv file.

It works well with csv file encoded UTF-8, but it doesn't work with csv file encoded SHIFT_JIS.

test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
test % cat sjis_sample.csv | iconv -f shift_jis -t utf-8 | gsed -r 's/"(.*)","(.*)"/\1 \2/'
こんにちは hello%
test % cat sjis_sample.csv | gsed -r 's/"(.*)","(.*)"/\1 \2/' | iconv -f shift_jis -t utf-8
"こんにちは","hello"%
LINE 1:
  Read file with UTF-8
LINE 2:
  Extracted text contents from csv file after converting encoding from SHIFT_JIS to UTF-8
  -> Works well
LINE 3:
  Extracted text contents from csv file without converting encoding
  -> It seems that `gsed` failed to capture text contents with match pattern.

Does anybody know how to use gsed for SHIFT_JIS encoded file?

Thank you.

% gsed --version
gsed (GNU sed) 4.8
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Written by Jay Fenlason, Tom Lord, Ken Pizzini,
Paolo Bonzini, Jim Meyering, and Assaf Gordon.

This sed program was built without SELinux support.

GNU sed home page: <https://www.gnu.org/software/sed/>.
General help using GNU software: <https://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
test % locale
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=

Solved

Thanks to @KamilCuk

GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:

LC_ALL=C sed ....

I set LANG instead of LC_ALL as C because I could not set LC_ALL as C.

test % cat sjis_convert.sh
#!/bin/bash
LANG=C

cat sjis_sample.csv |\
  gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
  iconv -f shift_jis -t utf-8

test % ./sjis_convert.sh
こんにちは hello%

Appendix

I could not set C to LC_ALL.

test % cat sjis_convert.sh
#!/bin/bash
LC_ALL=C

locale

echo ''

cat sjis_sample.csv |\
  gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
  iconv -f shift_jis -t utf-8

echo ''

locale

test % ./sjis_convert.sh
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=

"こんにちは","hello"
LANG="ja_JP.UTF-8"
LC_COLLATE="ja_JP.UTF-8"
LC_CTYPE="ja_JP.UTF-8"
LC_MESSAGES="ja_JP.UTF-8"
LC_MONETARY="ja_JP.UTF-8"
LC_NUMERIC="ja_JP.UTF-8"
LC_TIME="ja_JP.UTF-8"
LC_ALL=

Instead, I set C to LANG and it worked.

test % cat ./sjis_convert.sh
#!/bin/bash
LANG=C

locale

echo ''

cat sjis_sample.csv |\
  gsed -r 's/"(.*)","(.*)"/\1 \2/' |\
  iconv -f shift_jis -t utf-8

echo ''

locale

test % ./sjis_convert.sh
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

こんにちは hello
LANG="C"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

Solution

  • GNU sed is locale aware. If you want to work with raw bytes (ie. you can check what bytes represent " in Shift_JIS and feed that to sed) use:

    LC_ALL=C sed ....
    

    If you want to work with UTF-8, set UTF-8 locale, which most probably is your default:

    LC_ALL=en_US.UTF-8 sed ...
    

    And if you want to work with any other locale, tell it to sed:

    LC_ALL=ja_JP.Shift_JIS sed ...