[SOLVED] How to convert a Big5 encoded txt file to UTF8 encoded txt file?

How to convert a Big5 encoded txt file to UTF8 encoded txt file?

I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.

I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.

$ iconv -f BIG5 -t UTF8 in.txt > out.txt
iconv: in.txt:5:0: cannot convert

Are there any other ways to convert?

I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.

Solution

Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:

BIG5
BIG5-HKSCS
CP932
CP936
CP949
CP950
GB18030
GBK
JOHAB
Shift_JIS
Shift_JISX0213

Try them in turn:

$ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \
                  JOHAB Shift_JIS Shift_JISX0213; do \
  if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \
    echo $encoding ; \
  fi; \
done

With GNU libiconv, it prints

BIG5-HKSCS
CP950
GB18030

Is it in GB18030 encoding?

$ iconv -f GB18030 < unique_names_2012.txt

shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.

Is it in CP950 encoding?

$ iconv -f CP950 < unique_names_2012.txt

gives a conversion error at line 2294.

Is it in BIG5-HKSCS encoding?

$ iconv -f BIG5-HKSCS < unique_names_2012.txt

gives a conversion error at line 713.

So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).

In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.

The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.