pythonfilefilesystemshdf5hdfstore

How can I track and display constant changes made in an HDF5 file with the help of python


I have this function that constantly adds a new element in a dataset array of an HDF5 file every second.

from time import time, sleep

i = 100

def update_array():

    hf = h5py.File('task1.h5', 'r+')
    old_rec = np.array(hf.get('array'))
    global i
    i = i+1
    new_rec = np.append(old_rec, i)

    #deleting old record andreplacing with updated record
    del hf['array']
    new_data = hf.create_dataset('array', data = new_rec)
    print(new_rec)
    
    hf.close()

while True:
    sleep(1 - time() % 1)
    update_array()

The output of the print line (basically showing the updated array..... we do not know if it is getting saved in the file or not):

[101.]
[101. 102.]
[101. 102. 103.]
[101. 102. 103. 104.]
[101. 102. 103. 104. 105.]
[101. 102. 103. 104. 105. 106.]
[101. 102. 103. 104. 105. 106. 107.]
[101. 102. 103. 104. 105. 106. 107. 108.]

I want to have a separate notebook that can track changes made by the above function and display the updated contents of this dataset present in the HDF5 file system.

I want a separate function for this task because I want to make sure that the updated content gets saved in the HDF5 files, and perform further on fly operations on them as they keep arriving.


Solution

  • Here is a potential solution attaching attributes to the 'array' dataset. Adding attributes to a HDF5 data object are easy with .attrs. It has a dictionary-like syntax: h5obj[attr_name] = attr_value. Attribute value types can be ints, strings, floats, and arrays. You can add 2 attributes to your dataset with the following 2 lines:

    hf['array'].attrs['Last Value'] = i
    hf['array'].attrs['Time Added'] = ctime(time())
    

    To demonstrate, I added these lines to your code, along with several other modifications to address the following issues:

    1. Correct the errors noted in my comments (I added create_array() to initially create the file and dataset. I created it as a resizable dataset to simplify logic in update_array().
    2. I modified the update_array() code to enlarge the dataset and append the new value. This is much cleaner (and faster) than your 4 step process.
    3. I used Python's with / as: context manager to open the file. This eliminates the need to close it, and (more importantly) ensures it is closed cleanly if the program exits abnormally.
    4. I removed NumPy functions. There is no need to create an array if you are adding 1 scalar each time.
    5. My print statement shows the preferred method create a NumPy array from a dataset. Use hf['array'][:] instead of np.array(hf.get('array')).
    6. I prefer to open files once (unless there is a compelling reason to open & close). That eliminates file setup/teardown overhead. I did not do this. If you want to, move the with / as: lines into the main and pass the resulting hf object to create_array() and update_array()functions. If you do that, you can easily consolidate the 2 functions. (You will need logic to test if the 'array' dataset exists.)

    Code below:

    import h5py
    from time import time, sleep, ctime
    
    def create_array():
    
        with h5py.File('task1.h5', 'w') as hf:
            global i 
    
            #create dataset and add new record
            new_data = hf.create_dataset('array', shape=(1,), maxshape=(None,),
                                          data = [i])
            # add attributes
            hf['array'].attrs['Last Value'] = i
            hf['array'].attrs['Time Added'] = ctime(time())
    
            print(hf['array'][:])
    
    def update_array():
    
        with h5py.File('task1.h5', 'r+') as hf:
            global i 
            i += 1
          
            #resize dataset and add new record
            a0 = hf['array'].shape[0]
            hf['array'].resize(a0+1,axis=0)
            hf['array'][a0] = i
            
            # add attributes
            hf['array'].attrs['Last Value'] = i
            hf['array'].attrs['Time Added'] = ctime(time())
            
            print(hf['array'][:])
        
    i = 100
    create_array()
    
    while i < 110:
        sleep(1 - time() % 1)
        update_array()
    
    print('Done')