Given a function like:
import six
def convert_to_unicode(text):
"""Converts `text` to Unicode (if it's not already), assuming utf-8 input."""
if six.PY3:
if isinstance(text, str):
return text
elif isinstance(text, bytes):
return text.decode("utf-8", "ignore")
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
elif six.PY2:
if isinstance(text, str):
return text.decode("utf-8", "ignore")
elif isinstance(text, unicode):
return text
else:
raise ValueError("Unsupported string type: %s" % (type(text)))
else:
raise ValueError("Not running on Python2 or Python 3?")
Since six
handles the python2 and python3 compatibility, would the above convert_to_unicode(text)
function be equivalent to just six.text_type(text)
? I.e.
def convert_to_unicode(text):
return six.text_type(text)
Are there cases that the original convert_to_unicode
capture but six.text_type
can't?
Since six.text_type
ist just a reference to the str
or unicode
type, an equivalent function would be this:
def convert_to_unicode(text):
return six.text_type(text, encoding='utf8', errors='ignore')
But it doesn't behave the same in the corner cases, eg. it will just happily convert an integer, so you'd have to put some checks there first.
Also, I don't see why you would want to have errors='ignore'
.
You say you assume UTF-8.
But if this assumption is violated, you are silently deleting data.
I would strongly suggest using errors='strict'
.
I just realised this doesn't work if text
is already what you want.
Also, it happily raises a TypeError for any non-string input.
So how about this:
def convert_to_unicode(text):
if isinstance(text, six.text_type):
return text
return six.text_type(text, encoding='utf8', errors='ignore')
The only corner case uncovered here is that of the Python version being neither 2 nor 3.
And I still think you should use errors='strict'
.