pythonpandasdataframenumpyuint64

python Pandas optimization for ubyte data (0..255)


How is it possible to optimize Pandas df to ubyte data type (0..255)? (by default is int64 for integer)

If I will convert data to Categorical type, will df use less memory?

Or the only way to optimize it - use NumPy instead of Pandas?


Solution

  • For unsigned integer data in range 0..255, you can reduce the memory storage from default int64 (8 bytes) to use uint8 (1 byte). You can refer to this article for an example where the memory usage is substantially reduced from 1.5MB to 332KB (around one fifth).

    For Categorical type, as Pandas stores categorical columns as objects, this storage is not optimal. One of the reason is that it creates a list of pointers to the memory address of each value of your column. Refer to this article for more information.

    To use uint8, either you can do it when you input your data, e.g. during pd.read_csv call, you specify the dtype of input columns with uint8 type. (See the first article for an example). If you already have your data loaded and you want to convert the dataframe columns to use uint8, you can use the Series.astype() or DataFrame.astype() function with syntax like .astype('uint8').