What I am trying to do is take an existing HDF5 hierarchical dataset, select some of the elements, do an operation on them, and create a transformed file.
To read the file and do the operations, I am using a similar code
import h5py
with h5py.File(path, 'r') as fDFT: # Loads the dataset and closes the file after the loop
mol_ids = list(fDFT.keys()) # Main identifier of the data
for molid in mol_ids: # Iterate over each data
conf_ids = list( fDFT[molid].keys() )
for confid in conf_ids: # Iterate over the configurations
if 'opt' in str( confid ): # Only select specific configurations
res1 = fDFT[molid][confid]['mPOL'][:] # Read an entry from the file
res2 = convert(fDFT[molid][confid]['atNUM'][:]) # Transform another entry
I can create the output HDF5 file and write the first identifier easily, since mol_ids is a variable saved, by adding
hdf5_output = h5py.File('restricted.hdf5','w')
hdf5_output['molid']=mol_ids
but this creates a label 'molid', which is not what I want (the labels should be the values within molid), and I am not sure how to extend this approach to the nested level anyways.
Is there a nice way to get a HDF5 file that follows the hierarchy of the for loop, that is, would create a data structure that can be called as hdf5_output[molid][confid]['mPOL']
for example?
I've tried the create_group() function, but this did not allow me to iteratively create a group.
I think you are confusing the string 'molid'
with the variable name molid
. Also, you need to create a group with the name. Likely you want is: hdf5_output.create_group(molid)
(group name w/out quotes). As I understand your 2nd question, it would look something like this.
# Opens the source dataset for reading and
# creates the restricted dataset for writing
# Closes the file when exiting the with/as context manager
with h5py.File(path, 'r') as fDFT, \
h5py.File('restricted.hdf5','w') as hdf5_output:
for molid in fDFT: # Iterate over root level objects
# Create a group named molid in 2nd H5 file:
hdf5_output.create_group(molid)
for confid in fDFT[molid]: # Iterate over the configuration datasets
if 'opt' in confid: # Select configurations with 'opt' in dataset name
res1 = fDFT[molid][confid]['mPOL'][()] # Read an entry from the file
res2 = convert(fDFT[molid][confid]['atNUM'][()]) # Transform another entry
# The following code is my guess at the 2nd part (writing res/res2 data to the new file)
# This uses the h5py group.copy method to copy the res1 dataset - won't work for res2:
# It is the easiest/preferred way to copy datasets as-is
fDFT.copy(fDFT[molid][confid]['mPOL'], hdf5_output[molid][confid])
# This writes your res1 and res2 arrays to the new file hdf5_output[molid][confid]['mPOL'] = res1
hdf5_output[molid][confid]['mPOL'] = res1
hdf5_output[molid][confid]['atNUM_converted'] = res2
# Alternately, you could do this, and skip res1/res2 creation
hdf5_output[molid][confid]['mPOL'] = fDFT[molid][confid]['mPOL'][()]
hdf5_output[molid][confid]['atNUM_converted'] = convert(fDFT[molid][confid]['atNUM'][()])
Notes:
with/as
context manager.keys()
to get the object names. Similar to dictionary syntax, a loop on an H5 object will yield the keys (group and/or dataset names). Also, you don't need to convert to a list.group.copy()
method is the simplest way to copy an entire dataset as-is. By default it uses source dataset compression and chunking parameters and copies attributes. I included an example for the 'mPOL'
dataset. Complete description here.ds_name[()]
instead of ds_name[:]
. Ref: h5py Reading & writing data