pythonimagepython-3.xutf-8utf8-decode

Modifying non-text files via Python 3


I'm working on an encryption/decryption program, and I got it working on text files; however, I can not open any other formats. For example, if I do:

a_file = open('C:\Images\image.png', 'r', encoding='utf-8')
for a_line in a_file:
    print(a_line)

I get:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\WinPython-64bit-3.4.3.4\python-3.4.3.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 685, in runfile
execfile(filename, namespace)
File "C:\WinPython-64bit-3.4.3.4\python-3.4.3.amd64\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 85, in execfile
exec(compile(open(filename, 'rb').read(), filename, 'exec'), namespace)
File "C:/Comp_Sci/Coding/line_read_test.py", line 2, in <module>
for a_line in a_file:
File "C:\WinPython-64bit-3.4.3.4\python-3.4.3.amd64\lib\codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 0: invalid start byte

What am I doing terribly wrong?


Solution

  • Short version: You're opening binary files in text mode. Use 'rb' instead of 'r' (and drop the encoding parameter) and you'll be doing it right.

    Long version: Python 3 makes a very strict distinction between bytestrings and Unicode strings. The str type contains only Unicode strings; each character of a str is a single Unicode codepoint. The bytes type, on the other hand, represents a series of 8-bit values that do not necessarily correspond to text. E.g., a .PNG file should be loaded as a bytes object, not as a str object. By passing the encoding="utf-8" parameter to open(), you're telling Python that your file contains only valid UTF-8 text, which a .PNG obviously does not. Instead, you should be opening the file as a binary file with 'rb' and not using any encoding. Then you'll get bytes objects rather than str objects when you read the file, and you'll need to treat them differently.

    I see that @ignacio-vazquez-abrams has already posted good sample code while I've been typing this answer, so I won't duplicate his efforts. His code is correct: use it and you'll be fine.