pythoncythonhdf5hdf

Why does parallel reading of an HDF5 Dataset max out at 100% CPU, but only for large Datasets?


I'm using Cython to read a single Dataset from an HDF5 file using 64 threads. Each thread calculates a start index start and chunk size size, and reads from that chunk into a common buffer buf, which is a memoryview of a NumPy array. Crucially, each thread opens its own copy of the file and Dataset. Here's the code:

def read_hdf5_dataset(const char* file_name, const char* dataset_name,
                      long[::1] buf, int num_threads):
    cdef hsize_t base_size = buf.shape[0] // num_threads
    cdef hsize_t start, size
    cdef hid_t file_id, dataset_id, mem_space_id, file_space_id
    cdef int thread
    for thread in prange(num_threads, nogil=True):
        start = base_size * thread
        size = base_size + buf.shape[0] % num_threads \
            if thread == num_threads - 1 else base_size
        file_id = H5Fopen(file_name, H5F_ACC_RDONLY, H5P_DEFAULT)
        dataset_id = H5Dopen2(file_id, dataset_name, H5F_ACC_RDONLY)
        mem_space_id = H5Screate_simple(1, &size, NULL)
        file_space_id = H5Dget_space(dataset_id)
        H5Sselect_hyperslab(file_space_id, H5S_SELECT_SET, &start,
                            NULL, &size, NULL)
        H5Dread(dataset_id, H5Dget_type(dataset_id), mem_space_id,
                file_space_id, H5P_DEFAULT, <void*> &buf[start])
        H5Sclose(file_space_id)
        H5Sclose(mem_space_id)
        H5Dclose(dataset_id)
        H5Fclose(file_id)

Although it reads the Dataset correctly, the CPU utilization maxes out at exactly 100% on a float32 Dataset of ~10 billion entries, even though it uses all 64 CPUs (albeit only at ~20-30% utilization due to the I/O bottleneck) on a float32 Dataset of ~100 million entries. I've tried this on two different computing clusters with the same result. Maybe it has something to do with the size of the Dataset being greater than INT32_MAX?

What's stopping this code from running in parallel on extremely large datasets, and how can I fix it? Any other suggestions to improve the code's clarity or efficiency would also be appreciated.


Solution

  • Something is happening that is either preventing cython's prange from launching multiple threads, or is preventing the threads from getting anywhere once launched. It may or may not have anything to do directly with hdf5. Here's some possible causes:

    Part of the problem here is that it is going to be very hard for any users on this site to reproduce your exact issue, since we don't have most of your program and I at least don't have any 40 gig hdf5 files lying around (back in my grad school days tho, terabytes). If one of my above suggestions doesn't help, I think you have two ways forward: