pythonpandasprecisionsparse-matrixdtype

How to choose pandas Sparse dtype and what are the memory implications?


I am trying to understand how to set up a sparse pandas matrix to minimize memory usage and retain precision of all values. I did not find the answers in the pandas Sparse documentation. Below is an example which illustrates my questions:

  1. Why does a Sparse(int32) dataframe take as much memory as a Sparse(float32) dataframe? Is there any advantage in specifying a Sparse(int) dtype if this is the case?

  2. How does pandas decide what specific Sparse(int) dtype to use, e.g. int8 or int32? Given the example below (please see dataframes sdf_int32 and sdf_high_int32), it appears Sparse(int32) is always chosen regardless of whether Sparse(int8) might be more memory-efficient, or Sparse(int32) might truncate some values.

  3. Is the only way to avoid truncation and achieve minimum memory usage to specify Sparse(intNN) or Sparse(floatNN) dtype for each column?

import numpy as np
import pandas as pd

# Generate binary dense matrix with low density
df = pd.DataFrame()
for col in ['col1', 'col2', 'col3']:
    df[col] = np.where(np.random.random_sample(100_000_000) > 0.98, 1, 0)
df.name = 'Dense'

# Replace one column by values too high for int32 dtype
df_high = df.copy()
df_high['col1'] = df_high['col1'] * 100_000_000_000

# Convert df to sparse of various dtypes
sdf_float32 = df.astype(pd.SparseDtype('float32', 0))
sdf_float32.name = 'Sparse, float32'
sdf_int8 = df.astype(pd.SparseDtype('int8', 0))
sdf_int8.name = 'Sparse, int8'
sdf_int32 = df.astype(pd.SparseDtype('int', 0))
sdf_int32.name = 'Sparse, int32'
sdf_int64 = df.astype(pd.SparseDtype('int64', 0))
sdf_int64.name = 'Sparse, int64'

# Convert df_high to Sparse(int)
sdf_high_int32 = df_high.astype(pd.SparseDtype('int', 0))
sdf_high_int32.dtypes
sdf_high_int32['col1'].value_counts()
sdf_high_int32.name = 'Sparse, int32 highval'

# Print info for all dataframes
print(f" {df.name} Dataframe; Memory size: {df.memory_usage(deep=True).sum() / 1024 ** 2:.1f} MB, {df['col1'].dtype}")
for data in [sdf_float32, sdf_int8, sdf_int32, sdf_high_int32, sdf_int64]:
    print(f" {data.name} Dataframe; Memory size: {data.memory_usage(deep=True).sum() / 1024**2:.1f} MB,"
          f"Density {data.sparse.density:.5%}, {data['col1'].dtype}")

"""
Dense Dataframe; Memory size: 1144.4 MB, int32
 Sparse, float32 Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[float32, 0]
 Sparse, int8 Dataframe; Memory size: 28.6 MB,Density 1.99980%, Sparse[int8, 0]
 Sparse, int32 Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[int32, 0]
 Sparse, int32 highval Dataframe; Memory size: 45.8 MB,Density 1.99980%, Sparse[int32, 0]
 Sparse, int64 Dataframe; Memory size: 68.7 MB,Density 1.99980%, Sparse[int64, 0]
"""

# Show truncated values for sdf_high_int32
print(f"Values for sdf_high_int32, col1: \n {sdf_high_int32['col1'].value_counts()}")

"""
Values for sdf_high_int32, col1: 
 col1
0             98001473
1215752192     1998527
Name: count, dtype: int64
"""


Solution

  • There are two questions in your question, first about sparse matrices. Here is the pandas documentation:

    pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical “mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN / missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually stored in the array.

    That means that only the value chosen not to be stored (0 in your case) is not stored. The other values are stored as the datatype you have chosen. float32 and int32 both use 32 bits to represent a value, so they consume the same memory. The difference is what values they can store at what precision. The same would hold true for int64 versus float64.

    Since you only stored 0s and 1s, in your case, you can pick int8 as well for storing df.

    Now, answering your int question. Your platform seems to interpret int as int32. On my platform int is equivalent to int64. Numpy is responsible for this and here is a bit from the numpy documentation:

    numpy.int [...] numpy.int64, or numpy.int32
    ...
    For np.int a direct replacement with np.int_ or int is also good and will not change behavior, but the precision will continue to depend on the computer and operating system. If you want to be more explicit and review the current use, you have the following alternatives: [...]

    Because in your case int32 was chosen, you see the values of 0 (obviously) and 1215752192. The latter being 100_000_000_000 stored in an int32, i.e., there was an overflow and it being stored as 100_000_000_000 % (2**32) (run this in python) which gives 1215752192.

    BTW, here are the relevant parts from my python interpreter:

    >>> df_high['col1'] = df_high['col1'] * 100_000_000_000
    >>> sdf_int32 = df.astype(pd.SparseDtype('int', 0))
    >>> sdf_int32.dtypes
    col1    Sparse[int64, 0]
    col2    Sparse[int64, 0]
    col3    Sparse[int64, 0]
    dtype: object
    >>> sdf_high_int32 = df_high.astype(pd.SparseDtype('int', 0))
    >>> sdf_high_int32.dtypes
    col1    Sparse[int64, 0]
    col2    Sparse[int64, 0]
    col3    Sparse[int64, 0]
    dtype: object
    >>> 100_000_000_000 % (2**32)
    1215752192