pythonpython-3.xcsvencodingftfy

Fix encoding errors in csv with mixed encodings


there are some other questions here regarding this problem, but none of them fixed my problem so far.

I have a large (40MB) CSV file. Most of the file is encoded in iso-8859-1 (latin1), but some entries (just entries!) are in utf-8.

If i try to open the file with utf-8 encoding, python already throws an encoding error on me. If i open the file in iso-8859-1, the file can be read, but some entries stay mojibake then.

I tried the following code to fix the issues linewise, but obviously i miss something, because the utf-8 entries stay mojibake.


import os
import ftfy

# There are issues with file encoding with this file. Try to convert it to proper utf-8
sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
sourceFileName = os.path.join(folder, file + config.CSV_EXTENSION)
targetFileName = os.path.join(folder, file + "_temp" + config.CSV_EXTENSION)

# input file
csvfile = open(sourceFileName, "r", encoding = sourceEncoding)
Lines = csvfile.readlines()
for line in Lines:
    line = ftfy.fix_encoding(line) 

# output stream
outfile = open(targetFileName, "w", encoding = targetEncoding) # Windows doesn't like utf8
outfile.writelines(Lines)

# close files
csvfile.close()
outfile.close()

os.remove(sourceFileName)
os.rename(targetFileName, sourceFileName)

One specific string i have looks like this:

Ãberarbeitung A6 Heft

I want it to look like this:

Überarbeitung A6 Heft

Edit:

Some clarifications.

I assume there are some encoding issues in the file, because i know there are two different sources for entries in the csv. The most ones come into it by typing the value into a GUI. Some values come from a self-written script with god knows what encoding.

If i open the CSV in VSCode it assumes it is ISO-8859-1. But then some entries look like i mentioned above:

 Ãberarbeitung A6 Heft

If i change the encoding to UTF-8, this entry becomes 'right':

Überarbeitung A6 Heft

But then other entries change to the worse:

Testdurchf�hrung

The error message when trying to open the file with utf-8 encoding is:

Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0xe4 in position 56: invalid continuation byte

I will try to import the csv binary and decode it linewise. Maybe this will do the trick.


Solution

  • You can read in the binary mode and decode each line manually:

    def try_decode(b, encodings):
        for enc in encodings:
            try:
                return b.decode(enc)
            except UnicodeDecodeError:
                pass
        raise ValueError('no matching encoding!')
    
    with open(YOUR_FILE, 'rb') as fp:
        for b in fp:
            line = try_decode(b, ['utf8', 'latin1'])