pythonpython-3.xwindowscharacter-encodingtoml

Why does `toml.load(f)` fail with this file under Windows (but not on Linux)?


I have a TOML file which I want to process with this script.

This used to work fine under Linux. Under Windows (Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 13:23:52) [MSC v.1900 32 bit (Intel)] on win32) I get the following error:

Need to process 1 file(s)
Processing file test01.toml (1 of 1)
Traceback (most recent call last):
  File "py/process.py", line 27, in <module>
    add_text_fragment(input_dir + "/" + file)
  File "<string>", line 10, in add_text_fragment
  File "C:\Users\1\Anaconda3\lib\site-packages\toml\decoder.py", line 134, in lo
ad
    return loads(f.read(), _dict, decoder)
  File "C:\Users\1\Anaconda3\lib\encodings\cp1251.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x98 in position 985: char
acter maps to <undefined>

I assume that the error happens somewhere here:

f = open(toml_file_name, "r")
pt = toml.load(f)
f.close()

According to NotePad++, the file in question has UTF-8 encoding.

How can I fix it?

Clarification

I want to make sure that the script process.py correctly processes the input file, i.e. the execution gets past the comment starting with If at this point pt in addTextFragment.py

def add_text_fragment(toml_file_name):
    f = open(toml_file_name, "r")
    pt = toml.load(f)
    f.close()
    
    # If at this point pt contains dthe data of the input file,
    # then you have attained the goal.
    if (pt["type"] == "TA"):

and the variable pt contains the data from the input file.

I'm aiming to solve this on Windows 10, Python 3.7.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32.

Note: process.py executes addTextFragment.py for all files in a particular directory.


Solution

  • Just replace this line:

    f = open(toml_file_name, "r")
    

    with:

    f = open(toml_file_name, "r", encoding="utf-8")
    

    As you can see in the error message, Python is trying to read the file with the default system encoding for files - if the file contains any non-ASCII chars and was working in Linux, it means it has a different encoding - and the default encoding for all non-Windows world is utf-8 .