pythonpandashdfstore

How to decrease size overhead of HDFStore?


I am experimenting with different pandas-friendly storage schemes for tick data. The fastest (in terms of reading and writing) so far has been using an HDFStore with blosc compression and the "fixed" format.

store = pd.HDFStore(path, complevel=9, complib='blosc')
store.put(symbol, df)
store.close()

I'm indexing by ticker symbol since that is my common access pattern. However, this scheme adds about 1 MB of space per symbol. That is, if the data frame for an microcap stock contains just a thousand ticks for that day, the file will increase by a megabyte in size. So for a large universe of small stocks, the .h5 file quickly becomes unwieldy.

Is there a way to keep the performance benefits of blosc/fixed format but get the size down? I have tried the "table" format, which requires about 285 KB per symbol.

store.append(symbol, df, data_columns=True)

However, this format is dramatically slower to read and write.

In case it helps, here is what my data frame looks like:

exchtime     datetime64[ns]
localtime    datetime64[ns]
symbol               object
country               int64
exch                 object
currency              int64
indicator             int64
bid                 float64
bidsize               int64
bidexch              object
ask                 float64
asksize               int64
askexch              object

The blosc compression itself works pretty well since the resulting .h5 file requires only 30--35 bytes per row. So right now my main concern is decreasing the size penalty per node in HDFStore.


Solution

  • AFAIK there is a certain minimum for a block size in PyTables.

    Here are some suggestions: