pythonpandaspytableshdf

Pandas HDFStore caching


I am working with a medium-size dataset that consists of around 150 HDF files, 0.5GB each. There is a scheduled process that updates those files using store.append from pd.HDFStore.

I am trying to achieve the following scenario: For HDF file:

  1. Keep the process that updates the store running
  2. Open a store in a read-only mode
  3. Run a while loop that will be continuously selecting the latest available row from the store.
  4. Close the store on script exit

Now, this works fine, because we can have as many readers as we want, as long as all of them are in read-only mode. However, in step 3, because HDFStore caches the file, it is not returning the rows that were appended after the connection was open. Is there a way to select the newly added rows without re-opening the store?


Solution

  • After doing more research, I concluded that this is not possible with HDF files. The only reliable way of achieving the functionality above is to use a database (SQLite is closest - the read/write speed is lower than HDF but still faster than a fully-fledged database like Postgres or MySQL).