pythonencodingutf-8iso-8859-1

UTF-8 to ISO-8859-1 encoding: replace special characters with closest equivalent


Does anyone know of Python libraries that allows you to convert a UTF-8 string to ISO-8859-1 encoding in a smart way?

By smart, I mean replacing characters like "–" by "-" or so. And for the many characters for which an equivalent really cannot be thought of, replace with "?" (like encode('iso-8859-1', errors='replace') does).


Solution

  • Since the first 256 code points of Unicode match ISO-8859-1, it is possible to attempt encoding to ISO-8859-1, which will take care of all characters 0 to 255 without errors. For the characters leading to encoding errors, unidecode can be used.

    The following works on Python 2 and 3:

    from builtins import str
    import unidecode
    
    def unidecode_fallback(e):
        part = e.object[e.start:e.end]
        replacement = str(unidecode.unidecode(part) or '?')
        return (replacement, e.start + len(part))
    
    codecs.register_error('unidecode_fallback', unidecode_fallback)
    
    s = u'abcdé–fgh💔ijkl'.encode('iso-8859-1', errors='unidecode_fallback')
    print(s.decode('iso-8859-1'))
    

    Result:

    abcdé-fgh?ijkl
    

    This however converts non-ISO-8859-1 characters into an ASCII equivalent, while sometimes it may be better to have a non-ASCII, ISO-8859-1 equivalent.