pythonencodingutf-8cp1252

Which encoding should Python open function use?


I'm getting an exception when reading a file that contains a RIGHT DOUBLE QUOTATION MARK Unicode symbol. It is encoded in UTF-8 (0xE2 0x80 0x9D). The minimal example:

import sys

print(sys.getdefaultencoding())

f = open("input.txt", "r")
r.readline()

This script fails reading the first line even if the right quotation mark is not on the first line. The exception looks like that:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 102: char
acter maps to <undefined>

The input file is in utf-8 encoding, I've tried both with and without BOM. The default encoding returned by sys.getdefaultencoding() is utf-8.

This script fails on the machine with Python 3.6.5 but works well on another with Python 3.6.0. Both machines are Windows.

My questions are mostly theoretical, as this exception is thrown from external software that I cannot change, and it reads file that I don't wish to change. What should be the difference in these machines except the Python patch version? Why does vanilla open use cp1252 if the system default is utf-8?


Solution

  • As clearly stated in Python's open documentation:

    In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding.

    Windows defaults to a localized encoding (cp1252 on US and Western European versions). Linux typically defaults to utf-8.

    Because it is platform-dependent, use the encoding parameter and specify the encoding of the file explicitly.