pythondaskdask-delayed

Splitting very large csv files into smaller files


Is Dask proper to read large csv files in parallel and split them into multiple smaller files?


Solution

  • Yes, dask can read large CSV files. It will split them into chunks

    df = dd.read_csv("/path/to/myfile.csv")
    

    Then, when saving, Dask always saves CSV data to multiple files

    df.to_csv("/output/path/*.csv")
    

    See the read_csv and to_csv docstrings for much more information about this.