I have some problems with pandas' HDFStore
being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.
I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.
Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore. Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.
Datashape == (5730000, 339)
does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed)
In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.
hdf.put('/name', df, format='fixed')
since hdf.put('/name'.format(grp), df, format='table', data_columns=True)
seemed to be far too slow.df.groupby(df.index).first()
and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict)
to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.hdfstore.get('/name').columns
groupby
(df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict)
)How to access single columns using .select
I see that I can use this to retrieve only certain columns but only after I know the column names, I think.
Thank you for any advice!
For a HDFStore hdf
and a key
(from hdf.keys()
) you can get the column names with:
# Table stored with hdf.put(..., format='table')
columns = hdf.get_node('{}/table'.format(key)).description._v_names
# Table stored with hdf.put(..., format='fixed')
columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
note that hdf.get(key).columns
works as well, but it reads all the data into memory, while the approach above only reads the column names.
Full working example:
#!/usr/bin/env python
import pandas as pd
data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
with pd.HDFStore(path='store.h5', mode='a') as hdf:
hdf.put('/DATA/fixed_store', data, format='fixed')
hdf.put('/DATA/table_store', data, format='table', data_columns=True)
for key in hdf.keys():
try:
# column names of table store
print(hdf.get_node('{}/table'.format(key)).description._v_names)
except AttributeError:
try:
# column names of fixed store
print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
except AttributeError:
# e.g. a dataset created by h5py instead of pandas.
print('unknown node in HDF.')