pythonhdf5python-xarrayh5py

data loss when using xarray to hdf5 format


I am trying to create an hdf5 file for storing some generated data. The saving part is fine (I think) but when it comes to retrieving, some of the data input appears to be lost. I will provide the code for both saving and loading

saving data

import numpy as np
import xarray as xa
import h5py
import string
import random

save_h5py = h5py.File(".\data.h5", "w")
ids = [''.join(random.choice(string.ascii_uppercase) for _ in range(5)) for _ in range(10)]

for i in ids:
    data = np.random.rand(10)
    my_array = xa.DataArray(data, dims= ["id"], coords = {"id":ids})
    save_h5py.create_dataset(i, data=my_array )
save_h5py.close()

output for one of the xarray

<xarray.DataArray (id: 10)>
array([0.50655903, 0.33954833, 0.04186272, 0.16765385, 0.59900345,
       0.58764172, 0.38523892, 0.77926545, 0.61928491, 0.65678961])
Coordinates:
  * id       (id) <U5 'ESBNB' 'LEQDR' 'XVKFK' ... 'SSWBW' 'VMKYK' 'QSHXN'

loading data

file = h5py.File(".\data.h5", "r")
data = file.get(ids[2])
data_array = data[:]

result for reading

<HDF5 dataset "XVKFK": shape (10,), type "<f2">

array([0.50655903, 0.33954833, 0.04186272, 0.16765385, 0.59900345,
   0.58764172, 0.38523892, 0.77926545, 0.61928491, 0.65678961])

The trouble is here, how do I recall the ids? I tried many ways to access this data but with no luck. I thought the data might not have been saved so I tried to loaded the file in hdf5Viewer to see if the ids were present. However for some reason the program claims the file to be unreadable.


Solution

  • To get your code working (for me), I modified a few lines that create the datasets. See below:

    for i in ids:
        data = np.random.rand(10)
        save_h5py.create_dataset(i, data=data)
    save_h5py.close()
    

    h5py uses Python's dictionary syntax to access HDF5 objects (key is the object name, and value is the object). Note: objects are not dictionaries! Code below shows how this works for your example:

    with h5py.File('data.h5') as h5f:
        for ds_name in h5f:
            print(ds_name)
            print(h5f[ds_name][()])
    

    The example demonstrates 2 other important points:

    1. Use Python's with/as file context manager to avoid file corruption and locking issues.
    2. Preferred method to access all elements of a dataset is [()] instead of [:]