pythonpandashdf5vaex

Columns not recognized when importing HDF5 file


I am trying to import an HDF5 file in python. I do not have details how the file was written. Therefore, I tried vaex and pandas to open it. How can I specify my columns, so that they are recognized?

I tried to check the structure of the file with:

$ h5ls -v file.hdf5/DataSet
Opened "file.hdf5" with sec2 driver.
DataSet                  Dataset {5026/Inf}
    Attribute: Species scalar
        Type:      12-byte null-terminated ASCII string
    Attribute: Tuning scalar
        Type:      8-byte null-terminated ASCII string
    Location:  1:800
    Links:     1
    Chunks:    {1} 88 bytes
    Storage:   442288 logical bytes, 442288 allocated bytes, 100.00% utilization
    Type:      struct {
                   "Scan"             +0    native double
                   "col6"            +8    native double
                   "col5"            +16   native double
                   "col10"           +24   native double
                   "col7"            +32   native double
                   "col8"            +40   native double
                   "col1"            +48   native double
                   "col2"            +56   native double
                   "col4"            +64   native double
                   "col9"            +72   native double
                   "col3"            +80   native double
               } 88 bytes

vaex

When I am using vaex, the individual columns are not recognized and all the data ends up in a single column DataSet.

import vaex as vx
df = vx.open('file.hdf5')
df
df['DataSet']

The output looks like this:

#      DataSet
0      '(0., 1.36110629e-11, 5.45816316e-09, 3.79845801...
1      '(1., 1.3613447e-11, 5.45889204e-09, 3.79879826e...
...
Expression = DataSet
Length: 5,026 dtype: [('Scan', '<f8'), ('col6', '<f8'), ('col5', '<f8'), ('col10', '<f8'), ('col7', '<f8'), ('col8', '<f8'), ('col1', '<f8'), ('col2', '<f8'), ('col4', '<f8'), ('col9', '<f8'), ('col3', '<f8')] (column)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   0  '(0., 1.36110629e-11, 5.45816316e-09, 3.79845801...
   1  '(1., 1.3613447e-11, 5.45889204e-09, 3.79879826e...
...

Is there an option/way to tell vx.open how my columns are organized?

pandas

I tried to import the file using pandas as suggested here, but

pd.read_hdf('file.hdf5')

results in an ValueError.


Solution

  • I used the h5py package to read the HDF5 file and the vaex.from_array method to create a dataframe.

    import vaex
    import h5py
    
    with h5py.File('file.hdf5', 'r') as data_file :
        dset = data_file['DataSet']
        df = vaex.from_arrays(Scan = dset['Scan'], col1 = dset['col1'], col2 = dset['col2'], col3 = dset['col3'], col4 = dset['col4'], col5 = dset['col5'], col6 = dset['col6'], col7 = dset['col7'], col8 = dset['col8'], col9 = dset['col9'], col10 = dset['col10'])
    df