I am writing a small web app (https://github.com/r-owen/base_loom_server) and it runs on macOS and Raspberry Pi, but a Windows user reports an exception when Python reads a UTF-8-encoded resource file using importlib.resources.files.read_text()
with no arguments. The resource is JavaScript, but I doubt it matters. This is the error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 20834: character maps to <undefined>
I believe it is complaining about the first of four arrow characters in that file: "←" = LEFTWARDS ARROW. The file has the same style of arrow in all 4 directions: U+2190 through U+2193. Those are the only non-ASCII characters in the file (other than a few unnecessary • in a comment earlier in the file, which I will replace with *).
With no arguments, read_file decodes using UTF-8 in strict mode, which is what I want.
I would like to understand what's going on so I can work around this, plus any potential problems with letters that have diacritical marks (because the app also uses read_file to load language translation files as needed).
The failure occurs here
importlib.resources.files.read_text
is not the same as importlib.resources.read_text
.
The Traversable
object returned by importlib.resources.files
has the signature read_text(encoding=None)
while you seem to be referring to importlib.resources.read_text(*anchor, path_names, encoding='utf-8', errors='strict')
.
In my testing, encoding=None
(the default) used my localized Windows encoding of Windows-1252
(varies by locale). Use encoding='utf8'
explicitly instead.
My test consisted of a resources
subdirectory containing chinese.txt
encoded in UTF-8 and spanish.txt
encoded in Windows-1252:
import importlib.resources
res = importlib.resources.files('resources')
zh = res.joinpath('chinese.txt')
sp = res.joinpath('spanish.txt')
try:
print(zh.read_text())
except UnicodeDecodeError as e:
print(e)
print(zh.read_text(encoding='utf8'))
print(sp.read_text()) # default
Output (note 0x90 is an undefined byte in Windows-1252 encoding):
'charmap' codec can't decode byte 0x90 in position 7: character maps to <undefined>
你好吗
¿Qué pasa?