pythonutf-8iso-8859-1

Can you safely read utf8 and latin1 files with a naïve try-except block?


I believe that any valid latin1 character will either be interpreted correctly by Python's utf8 encoder or throw an error. I, therefore, claim that if you work with only either utf8 files or latin1 files, you can safely write the following code to read those files, without ending up with Mojibake:

from pathlib import Path

def read_utf8_or_latin1_text(path: Path, args, kwargs):
    try:
        return path.read_text(encoding="utf-8")
    except UnicodeDecodeError:
        return path.read_text(encoding="latin1")

I tested this hypothesis out on this large character data set and found that it holds up to scrutiny. Is this always the case?

Input:

import requests

insanely_many_characters = requests.get(
    "https://github.com/bits/UTF-8-Unicode-Test-Documents/raw/master/UTF-8_sequence_unseparated/utf8_sequence_0-0x10ffff_including-unassigned_including-unprintable-asis_unseparated.txt"
).text


print(
    f"\n=== test {len(insanely_many_characters)} utf-8 characters for same-same misinterpretations ==="
)
for char in insanely_many_characters:
    if (x := char.encode("utf-8").decode("utf-8")) != char:
        print(char, x)


print(
    f"\n=== test {len(insanely_many_characters)} latin1 characters for same-same misinterpretations ==="
)
latinable = []
nr = 0
for char in insanely_many_characters:
    try:
        if (x := char.encode("latin1").decode("latin1")) != char:
            print(char, x)
        latinable.append(char)
    except UnicodeEncodeError:
        nr += 1
if nr:
    print(f"{nr} characters not in latin1 set")

print('found the following valid latin1 characters: """\n' + "".join(latinable) + '\n"""')

print(
    f"\n=== test {len(latinable)} latin1 characters for utf-8 Mojibake ==="
)
for char in latinable:
    try:
        if (x := char.encode("latin1").decode("utf-8")) != char:
            print(char, x)
    except UnicodeDecodeError:
        pass

Output:


=== test 1111998 utf-8 characters for same-same misinterpretations ===

=== test 1111998 latin1 characters for same-same misinterpretations ===
1111742 characters not in latin1 set
found the following latin1 characters: """
    
 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~€‚ƒ„…†‡ˆ‰Š‹ŒŽ‘’“”•–—˜™š›œžŸ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ
"""

=== test 256 latin1 characters for utf-8 Mojibake ===

Addendum:

I see I totally forgot to test for sequences of latin1 characters, and only tested for individual characters. By adding this test:

print(
    f"\n=== test {len(latinable)} latin1 sequences wrongly interoperable by utf-8 ==="
)
for char1 in latinable:
    for char2 in latinable:
        try:
            if (x := (char1 + char2).encode("latin1").decode("utf-8")) != char1 + char2:
                print(char1 + char2, x)
        except UnicodeDecodeError:
            pass

I ended up generating many utf-8 Mojibake (1920 instances in total), which is a counterexample to my hypothesis!:

=== test 256 latin1 sequences wrongly interoperable by utf-8 ===
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
   
¡ ¡

 ⋮

Solution

  • You are incorrect. The latin-1 encoding is a 1-1 mapping for every byte. It is wholly possible for validly encoded latin-1 to, by coincidence, contain bytes that decode to a different UTF-8 character. The odds of such a thing occurring decrease the more non-ASCII characters are present in the data, but it can occur.

    Your test isn't finding these cases, because, by definition, it would take at least two characters encoded to latin-1 to produce a valid, but mismatched, encoding of a single character in UTF-8 (non-ASCII encoded to UTF-8 is always 2-4 bytes in length, never just one byte; ASCII encodes the same in both latin-1 and UTF-8). Since you only tested encoding a single character a latin-1, it's impossible for it to produce a legal UTF-8 representation, but many pairs (or triples, or quads) of latin-1 characters above the ASCII range will coincidentally produce legal UTF-8 bytes. They may make for total garbage, but they'll decode validly.