I was reading this high rated post in SO on unicodes
Here is an `illustration given there :
$ python
>>> import sys
>>> print sys.stdout.encoding
UTF-8
>>> print '\xe9' # (1)
é
>>> print u'\xe9' # (2)
é
>>> print u'\xe9'.encode('latin-1') # (3)
é
>>>
and the explanation were given as
(1) python outputs binary string as is, terminal receives it and tries to match its value with latin-1 character map. In latin-1, 0xe9 or 233 yields the character "é" and so that's what the terminal displays.
My question is: why does the terminal match to the latin-1 character map when the encoding
is 'UTF-8'
?
Also when I tried
>>> print '\xe9'
?
>>> print u'\xe9'
é
I get different result for the first one than what is described above. why is this discrepancy and where does latin-1
come to play in this picture?
You are missing some important context; in that case the OP configured the terminal emulator (Gnome Terminal) to interpret output as Latin-1 but left the shell variables set to UTF-8. Python thus is told by the shell to use UTF-8 for Unicode output but the actual configuration of the terminal is to expect Latin-1 bytes.
The print
output clearly shows the terminal is interpreting output using Latin-1, and is not using UTF-8.
When a terminal is set to UTF-8, the \xe9
byte is not valid (incomplete) UTF-8 and your terminal usually prints a question mark instead:
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print '\xe9'
?
>>> print u'\xe9'
é
>>> print u'\xe9'.encode('utf8')
é
If you instruct Python to ignore such errors, it gives you the U+FFFD REPLACEMENT CHARACTER glyph �
instead:
>>> '\xe9'.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/mj/Development/venvs/stackoverflow-2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe9 in position 0: unexpected end of data
>>> '\xe9'.decode('utf8', 'replace')
u'\ufffd'
>>> print '\xe9'.decode('utf8', 'replace')
�
That's because in UTF-8, \xe9
is the start byte of a 3-byte encoding, for the Unicode codepoints U+9000 through to U+9FFF, and if printed as just a single byte is invalid. This works:
>>> print '\xe9\x80\x80'
退
because that's the UTF-8 encoding of the U+9000 codepoint, a CJK Ideograph glyph.
If you want to understand the difference between encodings and Unicode, and how UTF-8 and other codecs work, I strongly recommend you read: