pythonstringunicodeencodepython-2.x

What happens when "encode()" is used on a string object in python?


I get the point about unicode, encoding, and decoding. But I don't understand why the encode() function works on a string object. I expected it to work only on a unicode type.

What is the behavior of encode() when it's used on a string object rather than a unicode object?


Solution

  • In Python 2 there are two types of codecs available; those that convert between str and unicode, and those that convert from str to str. Examples of the latter are the base64 and rot13 codecs.

    The str.encode() method exists to support the latter:

    'binary data'.encode('base64')
    

    But now that it exists, people are also using it for the unicode -> str codecs; encoding can only go from unicode to str (and decoding the other way). To support these, Python will implicitly decode your str value to unicode first, using the ASCII codec, before finally encoding.

    Incidentally, when using a str -> str codec on a unicode object, Python first implicitly encodes to str using the same ASCII codec.

    In Python 3, this has been solved by a) removing the bytes.encode() and str.decode() methods (remember that bytes is sorta the old str and str the new unicode), and b) by moving the str -> str encodings to the codecs module only, using the codecs.encode() and codecs.decode() functions. What codecs transform between the same type has also been clarified and updated, see the Python Specific Encodings section; note that the 'text' encodings noted there, where available in Python 2, encode to str instead.