By most accounts, one ought to be able to change the encoding of a UTF-8 file to a Latin-1 (ISO-8859-1) encoding by a trivial invocation of iconv such as:
iconv -c -f utf-8 -t ISO-8859-1//TRANSLIT
However, this fails to deal with accented characters properly. Consider for example:
$ echo $LC_ALL
C
$ cat Gonzalez.txt
González, M.
$ file Gonzalez.txt
Gonzalez.txt: UTF-8 Unicode text
$ iconv -c -f utf-8 -t ISO-8859-1//TRANSLIT < Gonzalez.txt > out
$ file out
out: ASCII text
$ cat out
Gonzalez, M.
I've tried various variations of the above, but none properly handles the accented "a", the point being that Latin-1 does have an accented "a".
Indeed, uconv
does handle the situation properly:
$ uconv -x Any-Accents -f utf-8 -t l1 < Gonzalez.txt > out
$ file out
out: ISO-8859 text
Opening the file in emacs or
Sublime shows the accented "a" properly. Same thing using -x nfc
.
Unfortunately, my target environment does not permit a solution using "uconv", so I am looking for a simple solution using either iconv or Python3.
My attempts using python3 so far have not been successful. For example, the following:
import sys
import fileinput # allows file to be specified or else reads from STDIN
for line in fileinput.input():
l=line.encode("latin-1","replace")
sys.stdout.buffer.write(l)
produces:
Gonza?lez, M.
(That's a literal "?".)
I've tried various other Python3 possibilities, so far without success.
Please note that I've reviewed numerous SO questions on this topic, but the answers using iconv or Python3 do not handle Gonzalez.txt properly.
There are two ways to encode A WITH ACUTE ACCENT in Unicode.
One is to use a combined character, as illustrated here with Python's built-in ascii
function:
>>> ascii('á')
"'\\xe1'"
But you can also use a combining accent following an unaccented letter a
:
>>> ascii('á')
"'a\\u0301'"
Depending on the displaying applications, the two variants may look indistinguishable (in my terminal, the latter looks a bit odd with the accent being too large).
Now, Latin-1 has an accented letter a
, but no combining accents, so that's why the acute becomes a question mark when encoding with errors="replace"
.
Fortunately, you can automatically switch between the two variants.
Without going into details (there are many details here), Unicode defined two normalization forms, called composed and decomposed, abbreviated NFC and NFD, respectively.
In Python, you can use the standard-library module unicodedata
:
>>> import unicodedata as ud
>>> ascii(ud.normalize('NFD', 'á'))
"'a\\u0301'"
>>> ascii(ud.normalize('NFC', 'á'))
"'\\xe1'"
In your specific case, you can convert the input strings to NFC form, which will increase coverage of Latin-1 characters:
>>> n = 'Gonza\u0301lez, M.'
>>> print(n)
González, M.
>>> n.encode('latin1', errors='replace')
b'Gonza?lez, M.'
>>> ud.normalize('NFC', n).encode('latin1', errors='replace')
b'Gonz\xe1lez, M.'