Hi! Are there any ways to load large, (ideally) compressed, and columnar-structured data faster into NumPy arrays in Python? Considering common solutions such as Pandas, Apache Parquet/Feather and HDF5, I am struggling to find a suiting way for my (time-series) problem.
As was expected, representing my data as NumPy array yields, by far, the fastest execution time for search problems such as binary search, significantly outperforming the same analysis when applied on a Pandas dataframe instead. On the other hand, when I try to store my data as
npz
files, for instance, directly loading thenpz
into NumPy arrays takes much longer compared to loading the same data into a Dataframe using thefasterparquet
engine and the columnar-storage in.parquet
. This loading, however, requires me to call.to_numpy()
on the resulting dataframe, which now again causes heavy delays in accessing the underlying numpy representation of the dataframe.
As mentioned above, one alternative I tried was to store the data in a format, that can be loaded without any intermediate conversion steps into a numpy array. However, loading time appears to be much slower when the data is stored as .npz
file (table with > 10M records and > 10 columns) compared to the same data stored as .parquet
file.
Actually, fastparquet supports loading your data into a dictionary of numpy arrays, if you set these up before hand. This is a "hidden" feature. If you give details of the dtype and size of the data you wish to load. this answer can be edited accordingly.
to call .to_numpy() on the resulting dataframe, which now again causes heavy delays
This is very surprising, it should normally be a copy-free view of the same underlying data.