[SOLVED] HDF5 files and plotting using chunks

HDF5 files and plotting using chunks

I'm new to HDF5 files and I don't understand how to access chunks in a dataset. I have quite a big dataset (1536, 2048, 11, 18, 2) which is chunked into (768, 1024, 1,1,1), each chunk represents half of an image. I want to plot the dataset, giving the mean values of each (whole) image (using matplotlib).

Question: how to I access separate chunks and how do I work with them? (How does h5py use them?)

This is my code:

bla = np.random.randint(0,100, (1536, 2048, 11, 18, 2))

with h5py.File('test.h5','w') as f:
    grp = f.create_group('Measurement 1')
    grp.create_dataset('data', data = bla, chunks = (768,1024,1,1,1))

f.close()

I have this to get access to the dataset, but I don't know how to access the chunks..

with h5py.File('test.h5', 'r') as hf:
            for dset in hf['Measurement 1'].keys():      
                print (dset)
                ds_hf = hf['Measurement 1']['data'] # returns HDF5 dataset object
                print (ds_hf)
                print (ds_hf.shape, ds_hf.dtype)
                data_f = hf['Measurement 1']['data'][:] # adding [:] returns a numpy array
hf.close()

I need the program to open each chunk, get the mean value and close it again before opening the next one, so my RAM doesn't get full.

Solution

Here is a sample code that you can understand how chunks work in hdf5, I developed it in a general way, you can modify it based on you requirements:

import h5py
import numpy as np

# Generate random data
bla = np.random.randint(0, 100, (1536, 2048, 11, 18, 2))

# Create the HDF5 file and dataset
with h5py.File('test.h5', 'w') as f:
    grp = f.create_group('Measurement 1')
    grp.create_dataset('data', data=bla, chunks=(768, 1024, 1, 1, 1))

# Open the HDF5 file
with h5py.File('test.h5', 'r') as hf:
    # Access the dataset
    ds_hf = hf['Measurement 1']['data']
    print(ds_hf)
    print(ds_hf.shape, ds_hf.dtype)

    # Iterate over the chunks
    for chunk_idx in np.ndindex(ds_hf.chunks):
        chunk = ds_hf[chunk_idx]
        # Process the chunk
        chunk_mean = np.mean(chunk)
        print(f"Chunk {chunk_idx}: Mean value = {chunk_mean}")

# Close the HDF5 file
hf.close()