pythonpython-2.7pandashdfstore

Pandas HDFStore: difference between using the select function and direct access


Given a pandas HDFStore containing a DataFrame:

import pandas as pd
import numpy.random as rd

df = pd.DataFrame(rd.randn(int(1000)).reshape(500, 2), columns=list('ab'))
store = pd.HDFStore('store.h5')
store.append('df', df, data_columns=['a', 'b'])

I can use the select function to retrieve a subset of the data, like so:

store.select('df', ['a > 0', 'b > 0'])

However, I can get the same output by falling back to the kind of command I might use if I were using a plain DataFrame that was not in an HDFStore:

store.df[(store.df.a > 0) & (store.df.b > 0)]

Is there a difference between these two approaches? If so, what is the difference?


Solution

  • If you run some benchmark, you'll find the following

    %timeit store.select('df', ['a > 0', 'b > 0'])
    100 loops, best of 3: 2.63 ms per loop
    %timeit store.df[(store.df.a > 0) & (store.df.b > 0)]
    100 loops, best of 3: 6.01 ms per loop
    

    This suggest that the first select access the file fewer times than the second method. Specifically each of the following

    %timeit store.df.a > 0
    100 loops, best of 3: 1.84 ms per loop
    %timeit store.df.b > 0
    1000 loops, best of 3: 1.82 ms per loop
    

    take about 2ms, then you need to select according to a logical AND of both. Only, then you need to apply the final filter. The select instead, access the data only once!