pythondeep-learningpytorchcomputer-visionh5py

Reading files with .h5 format and using it in dataset


I have two folders( one for train and one for test) and each one has around 10 files in h5 format. I want to read them and use them in a dataset. I have a function to read them, but I don't know how I can use it to read the file in my class.

def read_h5(path):
    data = h5py.File(path, 'r')
    image = data['image'][:]
    label = data['label'][:]
    return image, label

class Myclass(Dataset):
    def __init__(self, split='train', transform=None):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

    def __getitem__(self, index):
        raise NotImplementedError

Do you have a suggestion? Thank you in advance


Solution

  • This might be a start for what you want to do. I implemented the __init__(), but not __len__() or __get_item__(). User provides the path, and the init function calls the class method read_h5() to get the arrays of image and label data. There is a short main to create a class objects from 2 different H5 files. Modify the paths list with folder and filenames for all of your training and testing data.

    class H5_data():
        def __init__(self, path): #split='train', transform=None):
            self.path = path
            self.image, self.label = H5_data.read_h5(path)
    
        @classmethod
        def read_h5(cls,path):
            with h5py.File(path, 'r') as data:
                image = data['image'][()]
                label = data['label'][()]
                return image, label
            
    paths = ['train_0.h5', 'test_0.h5']
    for path in paths:
        h5_test = H5_data(path)
        print(f'For HDF5 file: {path}')
        print(f'image data, shape: {h5_test.image.shape}; dtype: {h5_test.image.dtype}')
        print(f'label data, shape: {h5_test.label.shape}; dtype: {h5_test.label.dtype}')
    

    IMHO, creating a class with the array data is overkill (and could lead to memory problems if you have really large datasets). It is more memory efficient to create h5py dataset objects, and access the data when you need it. Example below does the same as code above, without creating a class object with numpy arrays.

    paths = ['train_0.h5', 'test_0.h5']
    for path in paths:
        with h5py.File(path, 'r') as data:
            image = data['image']
            label = data['label']               
            print(f'For HDF5 file: {path}')
            print(f'image data, shape: {image.shape}; dtype: {image.dtype}')
            print(f'label data, shape: {label.shape}; dtype: {label.dtype}')