[SOLVED] UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent

UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent

Does anyone know of Python libraries that allows you to convert a UTF-8 string to ISO-8859-1 encoding in a smart way?

By smart, I mean replacing characters like "–" by "-" or so. And for the many characters for which an equivalent really cannot be thought of, replace with "?" (like encode('iso-8859-1', errors='replace') does).

Solution

Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.

The following works on Python 2 and 3:

from builtins import str
import unidecode

def unidecode_fallback(e):
    part = e.object[e.start:e.end]
    replacement = str(unidecode.unidecode(part) or '?')
    return (replacement, e.start + len(part))

codecs.register_error('unidecode_fallback', unidecode_fallback)

s = u'abcdé–fgh💔ijkl'.encode('iso-8859-1', errors='unidecode_fallback')
print(s.decode('iso-8859-1'))

Result:

abcdé-fgh?ijkl

This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.