pythonunicodepython-importlib

Why does importlib.resources read_text raise UnicodeDecodeError on Windows but not mac and RPi?


I am writing a small web app (https://github.com/r-owen/base_loom_server) and it runs on macOS and Raspberry Pi, but a Windows user reports an exception when Python reads a UTF-8-encoded resource file using importlib.resources.files.read_text() with no arguments. The resource is JavaScript, but I doubt it matters. This is the error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 20834: character maps to <undefined>

I believe it is complaining about the first of four arrow characters in that file: "←" = LEFTWARDS ARROW. The file has the same style of arrow in all 4 directions: U+2190 through U+2193. Those are the only non-ASCII characters in the file (other than a few unnecessary • in a comment earlier in the file, which I will replace with *).

With no arguments, read_file decodes using UTF-8 in strict mode, which is what I want.

I would like to understand what's going on so I can work around this, plus any potential problems with letters that have diacritical marks (because the app also uses read_file to load language translation files as needed).

The failure occurs here


Solution

  • importlib.resources.files.read_text is not the same as importlib.resources.read_text.

    The Traversable object returned by importlib.resources.files has the signature read_text(encoding=None) while you seem to be referring to importlib.resources.read_text(*anchor, path_names, encoding='utf-8', errors='strict').

    In my testing, encoding=None (the default) used my localized Windows encoding of Windows-1252 (varies by locale). Use encoding='utf8' explicitly instead.

    My test consisted of a resources subdirectory containing chinese.txt encoded in UTF-8 and spanish.txtencoded in Windows-1252:

    import importlib.resources
    
    res = importlib.resources.files('resources')
    zh = res.joinpath('chinese.txt')
    sp = res.joinpath('spanish.txt')
    
    try:
        print(zh.read_text())
    except UnicodeDecodeError as e:
        print(e)
    print(zh.read_text(encoding='utf8'))
    
    print(sp.read_text())  # default
    

    Output (note 0x90 is an undefined byte in Windows-1252 encoding):

    'charmap' codec can't decode byte 0x90 in position 7: character maps to <undefined>
    你好吗
    
    ¿Qué pasa?