pythoncsvunicodeutf-8dask

unicode error when using dask.dataframe.read_csv


I am runnning into the he error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xac in position 0: invalid start byte 2023-09-19 13:04:11,361 - distributed.core - ERROR - Exception while handling op register-client when using

import dask.dataframe as dd
fstringval=3
ddf=dd.read_csv(f"C:\\myfile\\witth\\fstring\\data{fstringval}.txt",  encoding="utf8", sep="|", header=None, dtype=dtypes, assume_missing=True, encoding_errors='ignore')
ddf.compute()

I have tried changing the encoding, however when i open the file in notepad it says that the encoding is UTF-8 so i was not expecting any improvement and indeed i got none. I havek also tried different encoding error parameter and all result in the same issue.


Solution

  • I have found this answer on github and it was as simple as updating msgpack-python that was in the 1.0.3 version to the 1.0.5 version using

    conda install -c conda-forge msgpack-python==1.0.5

    I do not understand the reasons why this happened, however, it solved the problem; please refer to the linked issue for a more specific answer.