I am trying to import an HDF5
file in python. I do not have details how the file was written. Therefore, I tried vaex
and pandas
to open it. How can I specify my columns, so that they are recognized?
I tried to check the structure of the file with:
$ h5ls -v file.hdf5/DataSet
Opened "file.hdf5" with sec2 driver.
DataSet Dataset {5026/Inf}
Attribute: Species scalar
Type: 12-byte null-terminated ASCII string
Attribute: Tuning scalar
Type: 8-byte null-terminated ASCII string
Location: 1:800
Links: 1
Chunks: {1} 88 bytes
Storage: 442288 logical bytes, 442288 allocated bytes, 100.00% utilization
Type: struct {
"Scan" +0 native double
"col6" +8 native double
"col5" +16 native double
"col10" +24 native double
"col7" +32 native double
"col8" +40 native double
"col1" +48 native double
"col2" +56 native double
"col4" +64 native double
"col9" +72 native double
"col3" +80 native double
} 88 bytes
When I am using vaex
, the individual columns are not recognized and all the data ends up in a single column DataSet
.
import vaex as vx
df = vx.open('file.hdf5')
df
df['DataSet']
The output looks like this:
# DataSet
0 '(0., 1.36110629e-11, 5.45816316e-09, 3.79845801...
1 '(1., 1.3613447e-11, 5.45889204e-09, 3.79879826e...
...
Expression = DataSet
Length: 5,026 dtype: [('Scan', '<f8'), ('col6', '<f8'), ('col5', '<f8'), ('col10', '<f8'), ('col7', '<f8'), ('col8', '<f8'), ('col1', '<f8'), ('col2', '<f8'), ('col4', '<f8'), ('col9', '<f8'), ('col3', '<f8')] (column)
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
0 '(0., 1.36110629e-11, 5.45816316e-09, 3.79845801...
1 '(1., 1.3613447e-11, 5.45889204e-09, 3.79879826e...
...
Is there an option/way to tell vx.open
how my columns are organized?
I tried to import the file using pandas
as suggested here, but
pd.read_hdf('file.hdf5')
results in an ValueError
.
I used the h5py
package to read the HDF5
file and the vaex.from_array
method to create a dataframe.
import vaex
import h5py
with h5py.File('file.hdf5', 'r') as data_file :
dset = data_file['DataSet']
df = vaex.from_arrays(Scan = dset['Scan'], col1 = dset['col1'], col2 = dset['col2'], col3 = dset['col3'], col4 = dset['col4'], col5 = dset['col5'], col6 = dset['col6'], col7 = dset['col7'], col8 = dset['col8'], col9 = dset['col9'], col10 = dset['col10'])
df