I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.
I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.
$ iconv -f BIG5 -t UTF8 in.txt > out.txt
iconv: in.txt:5:0: cannot convert
Are there any other ways to convert?
I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.
Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:
Try them in turn:
$ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \
JOHAB Shift_JIS Shift_JISX0213; do \
if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \
echo $encoding ; \
fi; \
done
With GNU libiconv, it prints
BIG5-HKSCS
CP950
GB18030
Is it in GB18030 encoding?
$ iconv -f GB18030 < unique_names_2012.txt
shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.
Is it in CP950 encoding?
$ iconv -f CP950 < unique_names_2012.txt
gives a conversion error at line 2294.
Is it in BIG5-HKSCS encoding?
$ iconv -f BIG5-HKSCS < unique_names_2012.txt
gives a conversion error at line 713.
So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).
In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.
The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.