[SOLVED] Fix encoding errors in csv with mixed encodings

Fix encoding errors in csv with mixed encodings

there are some other questions here regarding this problem, but none of them fixed my problem so far.

I have a large (40MB) CSV file. Most of the file is encoded in iso-8859-1 (latin1), but some entries (just entries!) are in utf-8.

If i try to open the file with utf-8 encoding, python already throws an encoding error on me. If i open the file in iso-8859-1, the file can be read, but some entries stay mojibake then.

I tried the following code to fix the issues linewise, but obviously i miss something, because the utf-8 entries stay mojibake.


import os
import ftfy

# There are issues with file encoding with this file. Try to convert it to proper utf-8
sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
sourceFileName = os.path.join(folder, file + config.CSV_EXTENSION)
targetFileName = os.path.join(folder, file + "_temp" + config.CSV_EXTENSION)

# input file
csvfile = open(sourceFileName, "r", encoding = sourceEncoding)
Lines = csvfile.readlines()
for line in Lines:
    line = ftfy.fix_encoding(line) 

# output stream
outfile = open(targetFileName, "w", encoding = targetEncoding) # Windows doesn't like utf8
outfile.writelines(Lines)

# close files
csvfile.close()
outfile.close()

os.remove(sourceFileName)
os.rename(targetFileName, sourceFileName)

One specific string i have looks like this:

Ãberarbeitung A6 Heft

I want it to look like this:

Überarbeitung A6 Heft

Edit:

Some clarifications.

I assume there are some encoding issues in the file, because i know there are two different sources for entries in the csv. The most ones come into it by typing the value into a GUI. Some values come from a self-written script with god knows what encoding.

If i open the CSV in VSCode it assumes it is ISO-8859-1. But then some entries look like i mentioned above:

 Ãberarbeitung A6 Heft

If i change the encoding to UTF-8, this entry becomes 'right':

Überarbeitung A6 Heft

But then other entries change to the worse:

Testdurchf�hrung

The error message when trying to open the file with utf-8 encoding is:

Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0xe4 in position 56: invalid continuation byte

I will try to import the csv binary and decode it linewise. Maybe this will do the trick.

Solution

You can read in the binary mode and decode each line manually:

def try_decode(b, encodings):
    for enc in encodings:
        try:
            return b.decode(enc)
        except UnicodeDecodeError:
            pass
    raise ValueError('no matching encoding!')

with open(YOUR_FILE, 'rb') as fp:
    for b in fp:
        line = try_decode(b, ['utf8', 'latin1'])