pythonpandasmemoryiochunks

what is the optimal chunksize in pandas read_csv to maximize speed?


I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv() with a chunksize=10,000 parameter.

However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.

Any ideas?


Solution

  • There is no "optimal chunksize" [*]. Because chunksize only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)

    To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...

    by looking at your number of columns, their dtypes, and the size of each; use either df.describe(), or else for more in-depth memory usage, by column:

    print 'df Memory usage by column...'
    print df.memory_usage(index=False, deep=True) / df.shape[0]