[SOLVED] what is the optimal chunksize in pandas read

what is the optimal chunksize in pandas read_csv to maximize speed?

I am using a 20GB (compressed) .csv file and I load a couple of columns from it using pandas pd.read_csv() with a chunksize=10,000 parameter.

However, this parameter is completely arbitrary and I wonder whether a simple formula could give me better chunksize that would speed-up the loading of the data.

Any ideas?

Solution

There is no "optimal chunksize" [*]. Because chunksize only tells you the number of rows per chunk, not the memory-size of a single row, hence it's meaningless to try to make a rule-of-thumb on that. ([*] although generally I've only ever seen chunksizes in the range 100..64K)

To get memory size, you'd have to convert that to a memory-size-per-chunk or -per-row...

by looking at your number of columns, their dtypes, and the size of each; use either df.describe(), or else for more in-depth memory usage, by column:

print 'df Memory usage by column...'
print df.memory_usage(index=False, deep=True) / df.shape[0]

Make sure you're not blowing out all your free memory while reading the csv: use your OS (Unix top/Windows Task Manager/MacOS Activity Monitor/etc) to see how much memory is being used.
One pitfall with pandas is that missing/NaN values, Python strs and objects take 32 or 48 bytes, instead of the expected 4 bytes for np.int32 or 1 byte for np.int8 column. Even one NaN value in an entire column will cause that memory blowup on the entire column, and pandas.read_csv() dtypes, converters, na_values arguments will not prevent the np.nan, and will ignore the desired dtype(!). A workaround is to manually post-process each chunk before inserting in the dataframe.
And use all the standard pandas read_csv tricks, like:
specify dtypes for each column to reduce memory usage - absolutely avoid every entry being read as dtype='string'/'object', especially long unique strings like datetimes, which is terrible for memory usage
specify usecols if you only want to keep a subset of columns
use date/time-converters rather than pd.Categorical if you want to reduce from 48 bytes to 1 or 4.
read large files in chunks. And if you know upfront what you're going to impute NA/missing values with, if possible do as much of that filling as you process each chunk, instead of at the end. If you can't impute with the final value, you probably at least can replace with a sentinel value like -1, 999, -Inf etc. and later you can do the proper imputation.