I need to share a large dataset from an HDF5 file between multiple processes and, for a set of reasons, mmap is not an option.
So I read it into a numpy array and then copy this array into shared memory, like this:
import h5py
from multiprocessing import shared_memory
dataset = h5py.File(args.input)['data']
shm = shared_memory.SharedMemory(
name=memory_label,
create=True,
size=dataset.nbytes
)
shared_tracemap = np.ndarray(dataset.shape, buffer=shm.buf)
shared_tracemap[:] = dataset[:]
But this approach doubles the amount of required memory, because I need to use a temporary variable. Is there a way to read the dataset directly into SharedMemory?
First, an observation: in your code dataset
is an h5py dataset object, not an NumPy array. It does not load the entire dataset into memory!
As @Monday's commented, read_direct()
reads directly from a HDF5 dataset to a NumPy array. Use it to avoid making an intermediate copy when slicing.
This is how to add it to your code. (Note, I suggest including the dtype
keyword with your np.ndarray()
call.)
shared_tracemap = np.ndarray(dataset.shape, dtype=dataset.dtype, buffer=shm.buf)
dataset.read_direct(shared_tracemap)
You can use source_sel=
and dest_sel=
keywords to read a slice from the dataset. Example:
dataset.read_direct(shared_tracemap,source_sel=np.s_[0:100],dest_sel=np.s_[0:100])