pythonunicodeencodingpython-3.xgb2312

This character - ㎜ - raises a UnicodeEncodeError


I am using a Python script to convert files from gb2312 to utf-8. This character messes everything: (it is one symbol, not "mm").

text = '㎜'
text.encode(encoding='gb2312')

raises this error:

UnicodeEncodeError: 'gb2312' codec can't encode character '\u040b' in position 1: illegal multibyte sequence

I can use workaround by text.replace('㎜', 'mm'). But what if there are others such characters? What is wrong with it? Why it is so special?

Is there a way to make Python treat it as any other character?


Solution

  • OK, so, I downloaded the file 1.php and ran your original script on it and I get a different error mesage:

    UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 99-100:
      illegal multibyte sequence
    

    The bytes in the file at offsets 99 and 100 are A9 4C in that order. That is neither a valid GB2312 nor a valid UTF-8 encoding of anything. I suspect you may be in the situation of having a whole bunch of files that are supposedly GB2312 but actually in some other encoding. If you need to just bull through all such problems, you can use errors='replace' and mode='rU' (the latter makes Python understand your DOS newlines).

    file_old=open('1.php', mode='rU', encoding='gb2312', errors='replace')
    

    This will insert U+FFFD REPLACEMENT CHARACTER in place of anything it can't decode, and continue. This destroys data; first try to figure out what the real encoding of the file is.

    By the way, don't forget to fix up your HTML header when you're done; the preferred form nowadays is

    <!doctype html>
    <html><head>
      <meta charset="utf-8">
    

    Concise, standard compliant, and tested to work all the way back to IE6.

    EDIT: On further investigation, GB2312 is a character set, not an encoding. There are several possible encodings of it, but only one allows the two-byte sequence A9 4C: in Big5, it corresponds to the character . (I do not know any of the languages that use Chinese characters; does that make more sense in context than ?)

    Python and iconv assume that GB2312 is encoded in a different format, EUC-CN, unless specifically told otherwise. If I modify your script to read

    file_old=open('1.php', mode='rU', encoding='big5', errors='strict')
    file_new=open('2.php', mode='w', encoding='utf-8')
    file_new.write(file_old.read())
    

    then it executes without error on the 1.php you provided.

    EDIT 2: On further further investigation, what web browsers do with <meta charset="gb2312"> is pretend you wrote <meta charset="gbk">. GBK is a superset of GB2312 that does include the character. Python, however, treats GB2312 per its original definition. So what you really want in order for your conversion to match the original file is

    file_old=open('1.php', mode='rU', encoding='gbk', errors='strict')