rusthdf5hdf

How to read an HDF5 compound datatype with unknown composition?


I have managed to read HDF5 data into rust using the hdf5 crate (shown below). But this method relies on knowing the exact structure of the dataset before hand.

In the data I am trying to read the column header names stay consistent but the order could vary and some could be present or not present. Currently all columns are f64 so that simplifies the problem.

Ideally for the processing I am doing the data needs to end up in a vector of hashmaps Vec<HashMap<String, f64>> or inverted would also work HashMap<String, Vec<f64>>. Where the key is the column heading.

Even just reading in each line of the dataset as binary and me transforming it using the descriptor would work. I just can't seem to find a way of doing this using the library. Is there a way to do this without resorting to the C HDF Library?

    use hdf5::*;
    #[derive(H5Type, Clone, PartialEq, Debug)] // register with HDF5
    #[repr(C)]
    struct Wrap {
        time: f64,
        px: f64,
        py: f64,
        pz: f64,
        r: f64,
        u: f64,
    };

    let file = File::open("file_1.hdf")?;
    let ds = file.dataset("object")?;
    let data_type = ds.dtype()?;
    let descriptor = data_type.to_descriptor()?;

    let data: Vec<Wrap> = ds.read_raw()?;

Solution

  • In case others have my same question I figured I would post what I ended up doing. I couldn't find a way to solely use the high level portion of the crate to do it, so I just used pieces of the Rust HDF crate and the external c function wrappers exposed by hdf5_sys (installed by the hdf5 crate).

    The code to get a column of data should look something like this:

    use hdf5::*;
    use hdf5::types::*;
    use hdf5::globals::*;
    use hdf5_sys::{h5s, h5p, h5d, h5t};
    
    let name_file = String::from("file_name.h5");
    let name_ds = "dataset_name";
    let name_col = "col_name";
    
    let file = File::open(name_file).expect("Could not open given HDF file.");
    let ds = file.dataset(name_ds).expect("Could not open dataset.");
    let buffer = &mut [0.0_f64; 1000]; 
    let name = to_cstring(name_col.as_ref()).unwrap();
    let ds_id = self.ds.id();
    
    unsafe{
        let dt_id = h5t::H5Tcreate(h5t::H5T_class_t::H5T_COMPOUND, 8);
        h5t::H5Tinsert(dt_id, name.as_ptr(), 0, *H5T_NATIVE_DOUBLE);
        h5d::H5Dread(ds_id, dt_id, h5s::H5S_ALL, h5s::H5S_ALL, h5p::H5P_DEFAULT, buffer.as_mut_ptr().cast());
    }
    
    let buffer_vec = buffer.to_vec();
    
    pub fn to_cstring<S: Borrow<str>>(string: S) -> Result<CString> {
        let string = string.borrow();
        #[allow(clippy::map_err_ignore)]
        CString::new(string).map_err(|_| format!("null byte in string: {string:?}").into())
    }
    

    This obviously isn't production level code, but it should help anyone looking for a path forward.

    Note. if you don't know the data type of the column either you can get the column descriptors using:

    let data_type = ds.dtype().expect("Could not find datatype.");
    let descriptor = data_type.to_descriptor().expect("Could not ascertain datatype descriptor.");