python-3.xutf-8iso-8859-1iconvtransliteration

Use iconv or python3 to recode utf-8 to Latin-1 (ISO-8859-1) preserving accented characters


By most accounts, one ought to be able to change the encoding of a UTF-8 file to a Latin-1 (ISO-8859-1) encoding by a trivial invocation of iconv such as:

 iconv -c -f  utf-8 -t ISO-8859-1//TRANSLIT

However, this fails to deal with accented characters properly. Consider for example:

$ echo $LC_ALL
C
$ cat Gonzalez.txt 
González, M.
$ file Gonzalez.txt
Gonzalez.txt: UTF-8 Unicode text
$ iconv -c -f  utf-8 -t ISO-8859-1//TRANSLIT < Gonzalez.txt > out
$ file out
out: ASCII text
$ cat out
Gonzalez, M.

I've tried various variations of the above, but none properly handles the accented "a", the point being that Latin-1 does have an accented "a".

Indeed, uconv does handle the situation properly:

$ uconv -x Any-Accents -f utf-8 -t l1 < Gonzalez.txt > out
$ file out
out: ISO-8859 text

Opening the file in emacs or Sublime shows the accented "a" properly. Same thing using -x nfc.

Unfortunately, my target environment does not permit a solution using "uconv", so I am looking for a simple solution using either iconv or Python3.

python3 attempts

My attempts using python3 so far have not been successful. For example, the following:

import sys
import fileinput  # allows file to be specified or else reads from STDIN

for line in fileinput.input():
    l=line.encode("latin-1","replace") 
    sys.stdout.buffer.write(l) 

produces:

Gonza?lez, M.

(That's a literal "?".)

I've tried various other Python3 possibilities, so far without success.

Please note that I've reviewed numerous SO questions on this topic, but the answers using iconv or Python3 do not handle Gonzalez.txt properly.


Solution

  • There are two ways to encode A WITH ACUTE ACCENT in Unicode.

    One is to use a combined character, as illustrated here with Python's built-in ascii function:

    >>> ascii('á')
    "'\\xe1'"
    

    But you can also use a combining accent following an unaccented letter a:

    >>> ascii('á')
    "'a\\u0301'"
    

    Depending on the displaying applications, the two variants may look indistinguishable (in my terminal, the latter looks a bit odd with the accent being too large).

    Now, Latin-1 has an accented letter a, but no combining accents, so that's why the acute becomes a question mark when encoding with errors="replace".

    Fortunately, you can automatically switch between the two variants. Without going into details (there are many details here), Unicode defined two normalization forms, called composed and decomposed, abbreviated NFC and NFD, respectively. In Python, you can use the standard-library module unicodedata:

    >>> import unicodedata as ud
    >>> ascii(ud.normalize('NFD', 'á'))
    "'a\\u0301'"
    >>> ascii(ud.normalize('NFC', 'á'))
    "'\\xe1'"
    

    In your specific case, you can convert the input strings to NFC form, which will increase coverage of Latin-1 characters:

    >>> n = 'Gonza\u0301lez, M.'
    >>> print(n)
    González, M.
    >>> n.encode('latin1', errors='replace')
    b'Gonza?lez, M.'
    >>> ud.normalize('NFC', n).encode('latin1', errors='replace')
    b'Gonz\xe1lez, M.'