I have a HDF5 file which contains three 1D arrays in different datasets. This file is created using h5py in Python and the 1D arrays are continually being appended to (ie growing). For simplicity, let’s call these 1D arrays as “A”, “B” and “C” and let’s say each array initially contains 100 values, but every second they will grow by one value (eg 101, 102 etc).
What I’m looking to do is create a single virtual dataset which is the concatenation of all three 1D arrays. This is relatively easy for the static case (3 x 100 values) but I want this virtual dataset to grow as more values are added (eg 303 values at 1 second, 306 at seconds etc.).
Is there a pythonic / efficient way to do this which isn’t just delete the virtual dataset and recreate it each second?
You don't have to delete the virtual dataset and recreate it when you add data. You can avoid this by using resizeable datasets and a resizeable VDS VirtualLayout (e.g. using the maxshape=
parameter). In addition, use the h5py.h5s.UNLIMITED
value to create an unlimited selection along an axis of the data source and VDS layout. They are described in the h5py docs here:
h5py.VirtualSource()
docs describes maxshape usageCreating virtual datasets
describes UNLIMITED under !NoteThe solution posted below will accomplish this task.
However, a word of warning before you implement it. HDF5/h5py I/O performance slows when you write a lot of small data blocks. Your example may be painfully slow. It's better to occasionally append large blocks of data than it is to add frequently append small blocks. (e.g. It's better to add 60*60 values every hour than it is it add 1 value every second.)
Here's is a solution that creates 3 resizeable datasets and a resizeable
VirtualLayout. The UNLIMITED
value is used for the slice definition from Source to the Layout.
num_dsets = 3
a0 = 100
UNLIMITED = h5py.h5s.UNLIMITED
with h5py.File("SO_78415089.h5", "w") as h5f:
dset_names = [ f'dset_{i:02d}' for i in range(num_dsets)]
# Create virtual layout
vds_layout = h5py.VirtualLayout(shape=(num_dsets,a0), maxshape=(num_dsets,None), dtype="int")
for i, dset_name in enumerate(dset_names):
# create data and load to dataset
arr_data = np.arange(a0*i,a0*(i+1))
h5f.create_dataset(dset_name,data=arr_data,maxshape=(None,))
# Create virtual source and map to layout
vsource = h5py.VirtualSource(h5f[dset_name])
vds_layout[i, :UNLIMITED] = vsource[:UNLIMITED]
# Add virtual layouts to virtual dataset
h5f.create_virtual_dataset("vdata", vds_layout, fillvalue=-1)
# resize datasets and add more values
for i, dset_name in enumerate(dset_names):
c0 = h5f[dset_name].shape[0]
h5f[dset_name].resize((c0+a0,))
arr_data = np.arange(c0+a0*i,c0+a0*(i+1))
h5f[dset_name][c0:c0+a0] = arr_data