pythonjsoncsvgoogle-colaboratorycodec

Python error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdb in position 0: unexpected end of data


I am trying to append a column of a csv file in a jsonl file using python. I tried fol code:-

    import csv
    import json

def csv_column_to_jsonl(csv_file, column_index, jsonl_file):
    with open(csv_file, 'r', encoding='utf-8-sig') as file:
        reader = csv.reader(file)
        data = [row[column_index] for row in reader]

    with open(jsonl_file, 'a',encoding='ascii') as file:
        for item in data:
            json.dump({"text": item}, file)
            file.write('\n')


csv_file = 'mydataset.csv'
column_index = 2  
jsonl_file = 'jsondata.jsonl'
csv_column_to_jsonl(csv_file, column_index, jsonl_file)

Error i am getting

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdb in position 0: unexpected end of data

Using chardet, the encoding types are as fol:-

jsondata.jsonl  {'encoding': 'ascii', 'confidence': 1.0, 'language': ''}
mydataset.csv    {'encoding': 'UTF-8-SIG', 'confidence': 1.0, 'language': ''}

The only combination that works and append csv in jsonl file is when i use encoding=unicode_escape for csv file but in that case. the resulting jsonl display data like fol:-

enter image description here

Till index 4 is the previous data while from index 120315, the appended data is shown which is not matching. Please let me know what to amend in my code.


Solution

  • I'm glad you figured out the problem. I didn't even think the JSON file could be the issue, and completely missed chardet's assessment of it being ASCII.

    In the future when you run into problems like this, (and I should have asked for this first) it will be helpful to know exactly which file has the problem. When you ran that and got the error, you probably saw a stack trace which pointed to a line that would have indicated the JSON file, and not the CSV file—which I interpreted your question to be primarily about.

    If the stack trace doesn't make sense, and you think/know you have a problem with one of many files, try adding some code to point out exactly which file has the problem. Since you were already getting a UnicodeDecodeError you could try to catch just that error, like:

    try:
        with open(csv_file, 'r', encoding='utf-8-sig') as file:
            reader = csv.reader(file)
            data = [row[column_index] for row in reader]
    except UnicodeDecodeError as e:
        print(f"couldn't decode {csv_file} with 'utf-8-sig': {e}")
    
    ...
    

    and repeat for the JSON file. If you happen to encounter some error other than UnicodeDecodeError, you'll see another stack-trace-error combination, and you can focus on that... yay, debugging!

    I believe if you do that you will get the very clear message:

    couldn't decode jsondata.jsonl with 'ascii': UnicodeDecodeError: 'utf-8' codec can't decode byte 0xdb in position 0: unexpected end of data
    

    Good luck!