I am dealing with Python objects containing Pandas DataFrame
and Numpy Series
objects. These can be large, several millions of rows.
E.g.
@dataclass
class MyWorld:
# A lot of DataFrames with millions of rows
samples: pd.DataFrame
addresses: pd.DataFrame
# etc.
I need to cache these objects, and I am hoping to find an efficient and painless way to serialise them, instead of standard pickle.dump()
. Are there any specialised Python serialisers for such objects that would pickle Series
data with some efficient codec and compression automatically? Alternatively, I need to hand construct several Parquet files, but that requires a lot of more manual code to deal with this, and I'd rather avoid that if possible.
Performance here may mean
I am aware of joblib.dump() which does some magic for these kind of objects, but based on the documentation I am not sure if this is relevant anymore.
What about storing huge structures in parquet format while pickling it, this can be automated easily:
import io
from dataclasses import dataclass
import pickle
import numpy as np
import pandas as pd
@dataclass
class MyWorld:
array: np.ndarray
series: pd.Series
frame: pd.DataFrame
@dataclass
class MyWorldParquet:
array: np.ndarray
series: pd.Series
frame: pd.DataFrame
def __getstate__(self):
for key, value in self.__annotations__.items():
if value is np.ndarray:
self.__dict__[key] = pd.DataFrame({"_": self.__dict__[key]})
if value is pd.Series:
self.__dict__[key] = self.__dict__[key].to_frame()
stream = io.BytesIO()
self.__dict__[key].to_parquet(stream)
self.__dict__[key] = stream
return self.__dict__
def __setstate__(self, data):
self.__dict__.update(data)
for key, value in self.__annotations__.items():
self.__dict__[key] = pd.read_parquet(self.__dict__[key])
if value is np.ndarray:
self.__dict__[key] = self.__dict__[key]["_"].values
if value is pd.Series:
self.__dict__[key] = self.__dict__[key][self.__dict__[key].columns[0]]
Off course we will have some trade off between performance and volumetry as reducing the second requires format translation and compression.
Lets create a toy dataset:
N = 5_000_000
data = {
"array": np.random.normal(size=N),
"series": pd.Series(np.random.uniform(size=N), name="w"),
"frame": pd.DataFrame({
"c": np.random.choice(["label-1", "label-2", "label-3"], size=N),
"x": np.random.uniform(size=N),
"y": np.random.normal(size=N)
})
}
We can compare the parquet conversion trade off (about 300 ms more):
%timeit -r 10 -n 1 pickle.dumps(MyWorld(**data))
# 1.57 s ± 162 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
%timeit -r 10 -n 1 pickle.dumps(MyWorldParquet(**data))
# 1.9 s ± 71.3 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)
And the volumetry gain (about 40 Mb spared):
len(pickle.dumps(MyWorld(**data))) / 2 ** 20
# 200.28876972198486
len(pickle.dumps(MyWorldParquet(**data))) / 2 ** 20
# 159.13739013671875
Indeed those metrics will strongly depends on the actual dataset to be serialized.