Does anyone know of Python libraries that allows you to convert a UTF-8 string to ISO-8859-1 encoding in a smart way?
By smart, I mean replacing characters like "–" by "-" or so. And for the many characters for which an equivalent really cannot be thought of, replace with "?" (like encode('iso-8859-1', errors='replace')
does).
Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.
The following works on Python 2 and 3:
from builtins import str
import unidecode
def unidecode_fallback(e):
part = e.object[e.start:e.end]
replacement = str(unidecode.unidecode(part) or '?')
return (replacement, e.start + len(part))
codecs.register_error('unidecode_fallback', unidecode_fallback)
s = u'abcdé–fgh💔ijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))
Result:
abcdé-fgh?ijkl
This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.