[SOLVED] How to manipulate/join multiple strings containing UTF-8 characters

How to manipulate/join multiple strings containing UTF-8 characters

My code needs to be compatible with both Python 2.x and 3.x versions. I am getting both string as input to my function, and I need to do some manipulation on those:

if len(str1) > 10:
    str1 = str1[:10] + '...'
if six.PY3:
    return ' : '.join((str1, str2))

For Python 2.x, the above join is giving error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

What is the cleaner way of handling such cases for all versions of 2.x and 3.x? As both the string are input to my code, I need to ensure that even if either of these strings contain UTF-8 characters, they should be joined properly.

Declaration : I am very new to Python.

Solution

In Python 3, you're usually, hopefully, pretty much exclusively dealing with str. That's the data type for strings. It expresses characters. Those characters aren't in any particular encoding; when manipulating them, you do not need to understand encodings. str1[:10] means "the first 10 characters", whether they are "abcdefghij" or "文字化けは楽しいんだ".

When encoded to actual bytes, the type is bytes. You do not want to be dealing with bytes when manipulating text.

In Python 2, because reasons, what is str in Python 3 was unicode in Python 2. What is bytes in Python 3 was str in Python 2.

String literals '' in Python 3 are strs, in Python 2 they're str (expressing bytes). The right string literal to express unicode in Python 2 is u''. u'' still works in Python 3 and maps to Python 3 str; so they express character based types in both languages.

Python 3	Python 2	expresses
`str`, literal: `''`, `u''`	`unicode`, literal: `u''`	characters
`bytes`, literal: `b''`	`str`, literal: `''`, `b''`	bytes

Concatenating str to bytes in Python 3 explicitly fails with a TypeError, but in Python 2 may or may not have funky side effects and implicit encoding conversions. You do not want to mix unicode and str in Python 2. Mostly you want to make sure everything is unicode.

So what you want to have it work in both languages is:

if len(str1) > 10:
    str1 = str1[:10] + u'...'
return u' : '.join((str1, str2))

And then you want to make sure str1 and str2 are str in Py3 and unicode in Py2. How to do that exactly depends on where they're coming from.