pythonpandasdataframehdfstore

HDFStore get column names


I have some problems with pandas' HDFStore being far to slow and unfortunately I'm unable to put together a satisfying solution from other questions here.

Situation

I have a big DataFrame, containing mostly floats and sometimes integer columns which goes through multiple processing steps (renaming, removing bad entries, aggregating by 30min). Each row has a timestamp associated to it. I would like to save some middle steps to a HDF file, so that the user can do a single step iteratively without starting from scratch each time.

Additionally the user should be able to plot certain column from these saves in order to select bad data. Therefore I would like to retrieve only the column names without reading the data in the HDFStore. Concretely the user should get a list of all columns of all dataframes stored in the HDF then they should select which columns they would like to see whereafter I use matplotlib to present them the corresponding data.

Data

shape == (5730000, 339) does not seem large at all, that's why I'm confused... (Might get far more rows over time, columns should stay fixed) In the first step I append iteratively rows and columns (that runs okay), but once that's done I always process the entire DataFrame at once, only grouping or removing data.

My approach

  1. I do all manipulations in memory since pandas seems to be rather fast and I/O is slower (HDF is on different physical server, I think)
  2. I use datetime index and automatically selected float or integer columns
  3. I save the steps with hdf.put('/name', df, format='fixed') since hdf.put('/name'.format(grp), df, format='table', data_columns=True) seemed to be far too slow.
  4. I use e.g. df.groupby(df.index).first() and df.groupby(pd.Grouper(freq='30Min')).agg(agg_dict) to process the data, where agg_dict is a dictonary with one function per column. This is incredibly slow as well.
  5. For plotting, I have to read-in the entire dataframe and then get the columns: hdfstore.get('/name').columns

Question

similar questions

How to access single columns using .select I see that I can use this to retrieve only certain columns but only after I know the column names, I think.

Thank you for any advice!


Solution

  • For a HDFStore hdf and a key (from hdf.keys()) you can get the column names with:

    # Table stored with hdf.put(..., format='table')
    columns = hdf.get_node('{}/table'.format(key)).description._v_names
    
    # Table stored with hdf.put(..., format='fixed')
    columns = list(hdf.get_node('{}/axis0'.format(key)).read().astype(str))
    

    note that hdf.get(key).columns works as well, but it reads all the data into memory, while the approach above only reads the column names.


    Full working example:

    #!/usr/bin/env python
    import pandas as pd
    
    data = pd.DataFrame({'a': [1,1,1,2,3,4,5], 'b': [2,3,4,1,3,2,1]})
    
    with pd.HDFStore(path='store.h5', mode='a') as hdf:
        hdf.put('/DATA/fixed_store', data, format='fixed')
        hdf.put('/DATA/table_store', data, format='table', data_columns=True)
        for key in hdf.keys():
            try:
                # column names of table store
                print(hdf.get_node('{}/table'.format(key)).description._v_names)
            except AttributeError:
                try:
                    # column names of fixed store
                    print(list(hdf.get_node('{}/axis0'.format(key)).read().astype(str)))
                except AttributeError:
                    # e.g. a dataset created by h5py instead of pandas.
                    print('unknown node in HDF.')