pythonstringstring-formattingpython-unicodeunicode-literals

UnicodeEncodeError when formatting u'ES SIOUF_1' in Python 2


I have this code:

"'{}'".format(u'ES SIOUF_1')

When run in Python 2, I receive the following error:

Traceback (most recent call last):
  File "<interactive input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 2: ordinal not in range(128)

The same code run from Python 3, gives:

>>> "'ES\xa0SIOUF_1'"

I don't need neither. What I would need is:

>>> "'ES SIOUF_1'"

I read many questions regarding "encoding" and "decoding" characters in Python, and some differences to this regard between Python 2 and 3.

However, I sincerely don't understand them and I'd like to solve this problem for both version of Python possibly.

The thing I've noticed is that doing:

type(u'ES SIOUF_1')

gives:

>>> <type 'unicode'> # PYTHON 2
>>> <class 'str'> # PYTHON 3

Solution

  • You have fallen in a corner case trap. Unicode defines U+00A0 (u'\xa0' in Python notation) to be a NO-BREAK SPACE character. It prints exactly the same as a normal space (U+0020 or u'\x20') but is a distinct character and is not in the ASCII range.

    For reasons I cannot guess (maybe a copy paste), you manage to get this no-break space in your unicode string, hence the weird printing in Python 3 and the inability to convert it to ascii in Python 2. As the format is a mere (byte) string in your Python 2 code, the unicode string is implicitely converted to ascii, which causes the exception. So in Python 2 you need to use a unicode format to get no error:

    u"'{}'".format(u'ES SIOUF_1')
    

    will work as it works in Python 3.

    How to fix?

    The correct way is to get rid of the offending u'\x20' before trying to process it. If you cannot, you can replace it explicitely with a normal space:

    "'{}'".format(u'ES SIOUF_1'.replace(u'\xa0', u'\x20'))
    

    should give what you want, both in Python 2 and Python 3