pythonpandasdaskdask-dataframejsonlines

Why does Pandas "utf-8-sig" encoding work but Dask doesn't?


file.json

[{"id":1, "name":"Tim"},
{"id":2, "name":"Jim"},
{"id":3, "name":"Paul"},
{"id":4, "name":"Sam"}]

It's encoded as 'UTF-8 with BOM"

When I use pandas, it works

df = pd.read_json('file.json',
    encoding='utf-8-sig',
    orient='records')

Successful

When I use dask, it fails

df = dd.read_json('file.json',
    encoding='utf-8-sig',
    orient='records')

ValueError: An error occurred while calling the read_json method registered to the pandas backend. Original Message: Expected object or value

I am trying to read the data in a dask df. The original message leads me to believe it's a parse issue but could this be a bug? Does dask not have the same encoding options as pandas?


Solution

  • By default dask.dataframe.read_json will expect the raw data to be line-delimited json, this can be changed by specifying lines=False as a kwarg. Here's a MRE:

    data = [
        {"id": 1, "name": "Tim"},
        {"id": 2, "name": "Jim"},
        {"id": 3, "name": "Paul"},
        {"id": 4, "name": "Sam"},
    ]
    
    from json import dumps
    
    with open("file.json", "w", encoding="utf-8-sig") as f:
        f.write(dumps(data))
    
    from dask.dataframe import read_json
    
    df = read_json("file.json", encoding="utf-8-sig", lines=False)
    print(df.compute())
    #    id  name
    # 0   1   Tim
    # 1   2   Jim
    # 2   3  Paul
    # 3   4   Sam