My code needs to be compatible with both Python 2.x and 3.x versions. I am getting both string as input to my function, and I need to do some manipulation on those:
if len(str1) > 10:
str1 = str1[:10] + '...'
if six.PY3:
return ' : '.join((str1, str2))
For Python 2.x, the above join is giving error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
What is the cleaner way of handling such cases for all versions of 2.x and 3.x? As both the string are input to my code, I need to ensure that even if either of these strings contain UTF-8 characters, they should be joined properly.
Declaration : I am very new to Python.
In Python 3, you're usually, hopefully, pretty much exclusively dealing with str
. That's the data type for strings. It expresses characters. Those characters aren't in any particular encoding; when manipulating them, you do not need to understand encodings. str1[:10]
means "the first 10 characters", whether they are "abcdefghij" or "文字化けは楽しいんだ".
When encoded to actual bytes, the type is bytes
. You do not want to be dealing with bytes
when manipulating text.
In Python 2, because reasons, what is str
in Python 3 was unicode
in Python 2. What is bytes
in Python 3 was str
in Python 2.
String literals ''
in Python 3 are str
s, in Python 2 they're str
(expressing bytes). The right string literal to express unicode
in Python 2 is u''
. u''
still works in Python 3 and maps to Python 3 str
; so they express character based types in both languages.
Python 3 | Python 2 | expresses |
---|---|---|
str , literal: '' , u'' |
unicode , literal: u'' |
characters |
bytes , literal: b'' |
str , literal: '' , b'' |
bytes |
Concatenating str
to bytes
in Python 3 explicitly fails with a TypeError
, but in Python 2 may or may not have funky side effects and implicit encoding conversions. You do not want to mix unicode
and str
in Python 2. Mostly you want to make sure everything is unicode
.
So what you want to have it work in both languages is:
if len(str1) > 10:
str1 = str1[:10] + u'...'
return u' : '.join((str1, str2))
And then you want to make sure str1
and str2
are str
in Py3 and unicode
in Py2. How to do that exactly depends on where they're coming from.