[SOLVED] How to encode unicode to bytes, so that the original string can be retrieved? in python 3.11

How to encode unicode to bytes, so that the original string can be retrieved? in python 3.11

In python 3.11 we can encode a string like:

string.encode('ascii', 'backslashreplace')

Which works neatly for say: hellö => hell\\xf6

However when I insert hellö w\\xf6rld I get hell\\xf6 w\\xf6rld (notice the second one has an literal part that looks like a character escape sequence)

Or in other words the following holds:

'hellö wörld'.encode('ascii', 'backslashreplace') == 'hellö w\\xf6rld'.encode('ascii', 'backslashreplace')

Which obviously means that data has been lost by the encoding.

Is there a way to make python actually encode correctly? So also backslashes are escaped themselves? Or a library to do so?

Solution

Use the unicode_escape codec and no error handler instead of the ascii codec with error handler. You are getting errors with the data being non-ascii and the error handler is causing the loss. The result will be only ASCII characters but it will handle the backslashes:

>>> 'hellö wörld'.encode('unicode_escape') == 'hell\\xf6 w\\xf6rld'.encode('unicode_escape')
False
>>> 'hellö wörld'.encode('unicode_escape')
b'hell\\xf6 w\\xf6rld'
>>> 'hell\\xf6 w\\xf6rld'.encode('unicode_escape')
b'hell\\\\xf6 w\\\\xf6rld'

If you don't have an ASCII requirement, then just .encode() (default UTF-8 in Python 3 which handles all Unicode). Then .decode() to restore.