
h5py error reading virtual dataset into NumPy array

I'm trying to load data from a virtual HDF dataset created with h5py and having some troubles properly loading the data.

Here is an example of my issue:

import h5py
import tools as ut

virtual  = h5py.File(ut.params.paths.virtual)

a = virtual['part2/index'][:]


This outputs:


Why? Why is the last element different when I copy the data into a NumPy array (value=[0]) vs when I read directly from the dataset (value=[890176134])?

Am I doing something horribly wrong without realizing it?

Thanks a lot.


  • Yes, you should get the same values from the Virtual Dataset or an array created from the Virtual Dataset. It's hard to diagnose the error without more details about the data.

    I used the h5py example to demonstrate how this should behave. Most of the code builds the HDF5 files. The section at end the compares the output. Code below modified from the example to create a variable number of source files (defined by a0=).

    Code to create the 'a0' source files with sample data:

    a0 = 5000
    # create sample data
    data = np.arange(0, 100).reshape(1, 100)
    # Create source files (0.h5 to a0.h5)
    for n in range(a0):
        with h5py.File(f"{n}.h5", "w") as f:
            row_data = data + n
            f.create_dataset("data", data=row_data)

    Code to define the virtual layout and assemble virtual dataset:

    # Assemble virtual dataset
    layout = h5py.VirtualLayout(shape=(a0, 100), dtype="i4")
    for n in range(a0):
        filename = "{}.h5".format(n)
        vsource = h5py.VirtualSource(filename, "data", shape=(100,))
        layout[n] = vsource
    # Add virtual dataset to output file
    with h5py.File("VDS.h5", "w", libver="latest") as f:
        f.create_virtual_dataset("vdata", layout)

    Code to read and print the data:

    # read data back
    # virtual dataset is transparent for reader!
    with h5py.File("VDS.h5", "r") as f:
        arr = f["vdata"][:]
        print("\nFirst 10 Elements in First Row:")
        print("Virtual dataset:")
        print(f["vdata"][0, :10])
        print("Reading vdata into Array:")
        print(arr[0, :10])
        print("Last 10 Elements of Last Row:")
        print("Virtual dataset:")
        print("Reading vdata into Array:")

    Output from code above (w/ a0=5000):

    First 10 Elements in First Row:
    Virtual dataset:
    [0 1 2 3 4 5 6 7 8 9]
    Reading vdata into Array:
    [0 1 2 3 4 5 6 7 8 9]
    Last 10 Elements of Last Row:
    Virtual dataset:
    [5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]
    Reading vdata into Array:
    [5089 5090 5091 5092 5093 5094 5095 5096 5097 5098]