I am using a Python script to convert files from gb2312
to utf-8
. This character messes everything: ㎜
(it is one symbol, not "mm").
text = '㎜'
text.encode(encoding='gb2312')
raises this error:
UnicodeEncodeError: 'gb2312' codec can't encode character '\u040b' in position 1: illegal multibyte sequence
I can use workaround by text.replace('㎜', 'mm')
. But what if there are others such characters? What is wrong with it? Why it is so special?
Is there a way to make Python treat it as any other character?
OK, so, I downloaded the file 1.php
and ran your original script on it and I get a different error mesage:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 99-100:
illegal multibyte sequence
The bytes in the file at offsets 99 and 100 are A9 4C in that order. That is neither a valid GB2312 nor a valid UTF-8 encoding of anything. I suspect you may be in the situation of having a whole bunch of files that are supposedly GB2312 but actually in some other encoding. If you need to just bull through all such problems, you can use errors='replace'
and mode='rU'
(the latter makes Python understand your DOS newlines).
file_old=open('1.php', mode='rU', encoding='gb2312', errors='replace')
This will insert U+FFFD REPLACEMENT CHARACTER
in place of anything it can't decode, and continue. This destroys data; first try to figure out what the real encoding of the file is.
By the way, don't forget to fix up your HTML header when you're done; the preferred form nowadays is
<!doctype html>
<html><head>
<meta charset="utf-8">
Concise, standard compliant, and tested to work all the way back to IE6.
EDIT: On further investigation, GB2312 is a character set, not an encoding. There are several possible encodings of it, but only one allows the two-byte sequence A9 4C: in Big5, it corresponds to the character 呶
. (I do not know any of the languages that use Chinese characters; does that make more sense in context than ㎜
?)
Python and iconv
assume that GB2312 is encoded in a different format, EUC-CN, unless specifically told otherwise. If I modify your script to read
file_old=open('1.php', mode='rU', encoding='big5', errors='strict')
file_new=open('2.php', mode='w', encoding='utf-8')
file_new.write(file_old.read())
then it executes without error on the 1.php
you provided.
EDIT 2: On further further investigation, what web browsers do with <meta charset="gb2312">
is pretend you wrote <meta charset="gbk">
. GBK is a superset of GB2312 that does include the ㎜
character. Python, however, treats GB2312 per its original definition. So what you really want in order for your conversion to match the original file is
file_old=open('1.php', mode='rU', encoding='gbk', errors='strict')