there are some other questions here regarding this problem, but none of them fixed my problem so far.
I have a large (40MB) CSV file. Most of the file is encoded in iso-8859-1 (latin1), but some entries (just entries!) are in utf-8.
If i try to open the file with utf-8 encoding, python already throws an encoding error on me. If i open the file in iso-8859-1, the file can be read, but some entries stay mojibake then.
I tried the following code to fix the issues linewise, but obviously i miss something, because the utf-8 entries stay mojibake.
import os
import ftfy
# There are issues with file encoding with this file. Try to convert it to proper utf-8
sourceEncoding = "iso-8859-1"
targetEncoding = "utf-8"
sourceFileName = os.path.join(folder, file + config.CSV_EXTENSION)
targetFileName = os.path.join(folder, file + "_temp" + config.CSV_EXTENSION)
# input file
csvfile = open(sourceFileName, "r", encoding = sourceEncoding)
Lines = csvfile.readlines()
for line in Lines:
line = ftfy.fix_encoding(line)
# output stream
outfile = open(targetFileName, "w", encoding = targetEncoding) # Windows doesn't like utf8
outfile.writelines(Lines)
# close files
csvfile.close()
outfile.close()
os.remove(sourceFileName)
os.rename(targetFileName, sourceFileName)
One specific string i have looks like this:
Ãberarbeitung A6 Heft
I want it to look like this:
Überarbeitung A6 Heft
Edit:
Some clarifications.
I assume there are some encoding issues in the file, because i know there are two different sources for entries in the csv. The most ones come into it by typing the value into a GUI. Some values come from a self-written script with god knows what encoding.
If i open the CSV in VSCode it assumes it is ISO-8859-1. But then some entries look like i mentioned above:
Ãberarbeitung A6 Heft
If i change the encoding to UTF-8, this entry becomes 'right':
Überarbeitung A6 Heft
But then other entries change to the worse:
Testdurchf�hrung
The error message when trying to open the file with utf-8 encoding is:
Exception has occurred: UnicodeDecodeError
'utf-8' codec can't decode byte 0xe4 in position 56: invalid continuation byte
I will try to import the csv binary and decode it linewise. Maybe this will do the trick.
You can read in the binary mode and decode each line manually:
def try_decode(b, encodings):
for enc in encodings:
try:
return b.decode(enc)
except UnicodeDecodeError:
pass
raise ValueError('no matching encoding!')
with open(YOUR_FILE, 'rb') as fp:
for b in fp:
line = try_decode(b, ['utf8', 'latin1'])