This is driving my somewhat nutty at the moment. It is clear from my last days of research that unicode is a complex topic. But here is behavior that I do not know how to address.
If I read a file with non-ASCII characters from disk and wrtie it back to file everything works as planned. however, when I read the same file from sys.stdin, id does not work and the the non-ASCII characters are not encoded properly. The sample code is here:
# -*- coding: utf-8 -*-
import sys
with open("testinput.txt", "r") as ifile:
lines = ifile.read()
with open("testout1.txt", "w") as ofile:
for line in lines:
ofile.write(line)
with open("testout2.txt", "w") as ofile:
for line in sys.stdin:
ofile.write(line)
The input file testinput.txt
is this:
を
Sōten_Kōro
when I run the script from command line as cat testinput.txt | python test.py
I get the following output respectively:
testout1.txt
:
を
Sōten_Kōro
testout2.txt
:
???
S??ten_K??ro
Any ideas how to adress this would be of great help. Thanks. Paul.
The reason is that you took a short cut, which should never been taken.
You should always define an encoding. So when you read the file, you should specify that you are reading UTF-8, or whenever. Or just make explicit that you are reading binary files.
In your case, python interpreter will use UTF-8 as standard encoding when reading from files, because this is the default in Linux and macos.
But when you read from standard input, the default is defined by the locale encoding, or by the environment variable.
I refer to How to change the stdin encoding on python on how to solve. This answer is just to explain the cause.