pythonunicodefuture-proof

(unicode error) 'unicodeescape' codec can't decode bytes - string with '\u'


Writing my code for Python 2.6, but with Python 3 in mind, I thought it was a good idea to put

from __future__ import unicode_literals

at the top of some modules. In other words, I am asking for troubles (to avoid them in the future), but I might be missing some important knowledge here. I want to be able to pass a string representing a filepath and instantiate an object as simple as

MyObject('H:\unittests')

In Python 2.6, this works just fine, no need to use double backslashes or a raw string, even for a directory starting with '\u..', which is exactly what I want. In the __init__ method I make sure all single \ occurences are interpreted as '\\', including those before special characters as in \a, \b, \f,\n, \r, \t and \v (only \x remains a problem). Also decoding the given string into unicode using (local) encoding works as expected.

Preparing for Python 3.x, simulating my actual problem in an editor (starting with a clean console in Python 2.6), the following happens:

>>> '\u'
'\\u'
>>> r'\u'
'\\u'

(OK until here: '\u' is encoded by the console using the local encoding)

>>> from __future__ import unicode_literals
>>> '\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence

In other words, the (unicode) string is not interpreted as unicode at all, nor does it get decoded automatically with the local encoding. Even so for a raw string:

>>> r'\u'
SyntaxError: (unicode error) 'rawunicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX

same for u'\u':

>>> u'\u'
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: end of string in escape sequence

Also, I would expect isinstance(str(''), unicode) to return True (which it does not), because importing unicode_literals should make all string-types unicode. (edit:) Because in Python 3, all strings are sequences of Unicode characters, I would expect str('')) to return such a unicode-string, and type(str('')) to be both <type 'unicode'>, and <type 'str'> (because all strings are unicode) but also realise that <type 'unicode'> is not <type 'str'>. Confusion all around...

Questions

edit: In Python 3, <type 'str'> is a Unicode object and <type 'unicode'> simply does not exist. In my case I want to write code for Python 2(.6) that will work in Python 3. But when I import unicode_literals, I cannot check if a string is of <type 'unicode'> because:

My modules use to be encoded in 'utf-8' by a # coding: UTF-8 comment at the top, while my locale.getdefaultlocale()[1] returns 'cp1252'. So if I call MyObject('çça') from my console, it is encoded as 'cp1252' in Python 2, and in 'utf-8' when calling MyObject('çça') from the module. In Python 3, it will not be encoded, but a unicode literal.

edit:

I gave up hope about being allowed to avoid using '\' before a u (or x for that matter). Also I understand the limitations of importing unicode_literals. However, the many possible combinations of passing a string from a module to the console and vica versa with each different encoding, and on top of that importing unicode_literals or not and Python 2 vs Python 3, made me want to create an overview by actual testing. Hence the table below. enter image description here

In other words, type(str('')) does not return <type 'str'> in Python 3, but <class 'str'>, and all of Python 2 problems seem to be avoided.


Solution

  • AFAIK, all that from __future__ import unicode_literals does is to make all string literals of unicode type, instead of string type. That is:

    >>> type('')
    <type 'str'>
    >>> from __future__ import unicode_literals
    >>> type('')
    <type 'unicode'>
    

    But str and unicode are still different types, and they behave just like before.

    >>> type(str(''))
    <type 'str'>
    

    Always, is of str type.

    About your r'\u' issue, it is by design, as it is equivalent to ru'\u' without unicode_literals. From the docs:

    When an 'r' or 'R' prefix is used in conjunction with a 'u' or 'U' prefix, then the \uXXXX and \UXXXXXXXX escape sequences are processed while all other backslashes are left in the string.

    Probably from the way the lexical analyzer worked in the python2 series. In python3 it works as you (and I) would expect.

    You can type the backslash twice, and then the \u will not be interpreted, but you'll get two backslashes!

    Backslashes can be escaped with a preceding backslash; however, both remain in the string

    >>> ur'\\u'
    u'\\\\u'
    

    So IMHO, you have two simple options: