I'm using Cython to read a single Dataset from an HDF5 file using 64 threads. Each thread calculates a start index start
and chunk size size
, and reads from that chunk into a common buffer buf
, which is a memoryview of a NumPy array. Crucially, each thread opens its own copy of the file and Dataset. Here's the code:
def read_hdf5_dataset(const char* file_name, const char* dataset_name,
long[::1] buf, int num_threads):
cdef hsize_t base_size = buf.shape[0] // num_threads
cdef hsize_t start, size
cdef hid_t file_id, dataset_id, mem_space_id, file_space_id
cdef int thread
for thread in prange(num_threads, nogil=True):
start = base_size * thread
size = base_size + buf.shape[0] % num_threads \
if thread == num_threads - 1 else base_size
file_id = H5Fopen(file_name, H5F_ACC_RDONLY, H5P_DEFAULT)
dataset_id = H5Dopen2(file_id, dataset_name, H5F_ACC_RDONLY)
mem_space_id = H5Screate_simple(1, &size, NULL)
file_space_id = H5Dget_space(dataset_id)
H5Sselect_hyperslab(file_space_id, H5S_SELECT_SET, &start,
NULL, &size, NULL)
H5Dread(dataset_id, H5Dget_type(dataset_id), mem_space_id,
file_space_id, H5P_DEFAULT, <void*> &buf[start])
H5Sclose(file_space_id)
H5Sclose(mem_space_id)
H5Dclose(dataset_id)
H5Fclose(file_id)
Although it reads the Dataset correctly, the CPU utilization maxes out at exactly 100% on a float32 Dataset of ~10 billion entries, even though it uses all 64 CPUs (albeit only at ~20-30% utilization due to the I/O bottleneck) on a float32 Dataset of ~100 million entries. I've tried this on two different computing clusters with the same result. Maybe it has something to do with the size of the Dataset being greater than INT32_MAX?
What's stopping this code from running in parallel on extremely large datasets, and how can I fix it? Any other suggestions to improve the code's clarity or efficiency would also be appreciated.
Something is happening that is either preventing cython's prange
from launching multiple threads, or is preventing the threads from getting anywhere once launched. It may or may not have anything to do directly with hdf5. Here's some possible causes:
Are you pre-allocating a buf
large enough to hold the entire dataset before running your function? If so, that means your program is allocating 40+ gigabytes of memory (4 bytes per float32
). How much memory do the nodes you're running on have? Are you the only user? Memory starvation could easily cause the kind of performance issues you describe.
Both cython and hdf5 require certain compilation flags in order to correctly support parallelism. Between your small and large dataset runs did you modify or recompile your code at all?
One easy way to explain why your program is using 100% of a single cpu is that it's getting hung somewhere before your read_hdf5_dataset
function is ever called. What other code in your program runs first, and could it be causing the problems you see?
Part of the problem here is that it is going to be very hard for any users on this site to reproduce your exact issue, since we don't have most of your program and I at least don't have any 40 gig hdf5 files lying around (back in my grad school days tho, terabytes). If one of my above suggestions doesn't help, I think you have two ways forward: