I get the point about unicode, encoding, and decoding. But I don't understand why the encode()
function works on a string object. I expected it to work only on a unicode type.
What is the behavior of encode()
when it's used on a string object rather than a unicode object?
In Python 2 there are two types of codecs available; those that convert between str
and unicode
, and those that convert from str
to str
. Examples of the latter are the base64
and rot13
codecs.
The str.encode()
method exists to support the latter:
'binary data'.encode('base64')
But now that it exists, people are also using it for the unicode
-> str
codecs; encoding can only go from unicode
to str
(and decoding the other way). To support these, Python will implicitly decode your str
value to unicode
first, using the ASCII codec, before finally encoding.
Incidentally, when using a str
-> str
codec on a unicode
object, Python first implicitly encodes to str
using the same ASCII codec.
In Python 3, this has been solved by a) removing the bytes.encode()
and str.decode()
methods (remember that bytes
is sorta the old str
and str
the new unicode
), and b) by moving the str
-> str
encodings to the codecs
module only, using the codecs.encode()
and codecs.decode()
functions. What codecs transform between the same type has also been clarified and updated, see the Python Specific Encodings section; note that the 'text' encodings noted there, where available in Python 2, encode to str
instead.