I am trying to append a column of a csv file in a jsonl file using python. I tried fol code:-
import csv
import json
def csv_column_to_jsonl(csv_file, column_index, jsonl_file):
with open(csv_file, 'r', encoding='utf-8-sig') as file:
reader = csv.reader(file)
data = [row[column_index] for row in reader]
with open(jsonl_file, 'a',encoding='ascii') as file:
for item in data:
json.dump({"text": item}, file)
file.write('\n')
csv_file = 'mydataset.csv'
column_index = 2
jsonl_file = 'jsondata.jsonl'
csv_column_to_jsonl(csv_file, column_index, jsonl_file)
Error i am getting
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdb in position 0: unexpected end of data
Using chardet, the encoding types are as fol:-
jsondata.jsonl {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
mydataset.csv {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}
The only combination that works and append csv in jsonl file is when i use encoding=unicode_escape for csv file but in that case. the resulting jsonl display data like fol:-
Till index 4 is the previous data while from index 120315, the appended data is shown which is not matching. Please let me know what to amend in my code.
I'm glad you figured out the problem. I didn't even think the JSON file could be the issue, and completely missed chardet's assessment of it being ASCII.
In the future when you run into problems like this, (and I should have asked for this first) it will be helpful to know exactly which file has the problem. When you ran that and got the error, you probably saw a stack trace which pointed to a line that would have indicated the JSON file, and not the CSV file—which I interpreted your question to be primarily about.
If the stack trace doesn't make sense, and you think/know you have a problem with one of many files, try adding some code to point out exactly which file has the problem. Since you were already getting a UnicodeDecodeError you could try to catch just that error, like:
try:
with open(csv_file, 'r', encoding='utf-8-sig') as file:
reader = csv.reader(file)
data = [row[column_index] for row in reader]
except UnicodeDecodeError as e:
print(f"couldn't decode {csv_file} with 'utf-8-sig': {e}")
...
and repeat for the JSON file. If you happen to encounter some error other than UnicodeDecodeError, you'll see another stack-trace-error combination, and you can focus on that... yay, debugging!
I believe if you do that you will get the very clear message:
couldn't decode jsondata.jsonl with 'ascii': UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdb in position 0: unexpected end of data
Good luck!