I have data that can have different sized arrays per row like:
data = {
'a': [np.array([1.,2.]), np.array([6.,7.,.6]), np.array([np.nan])],
'b': np.array([99., 66., 88.])
}
df = pd.DataFrame(data)
I want to save this to a hdf5 file for archiving purposes and be able to reuse it in Matlab.
Saving it with
df.to_hdf('df.h5', mode='w', key='data', format='fixed')
is possible but not reusable, as it saves it in pandas specific format.
Saving it with
df.to_hdf('df.h5', mode='w', key='data', format='table')
is not possible and results in
TypeError: Cannot serialize the column [a]
because its data contents are not [string] but [mixed] object dtype
also trying something like:
with h5py.File('df.h5', 'w') as h5f:
h5f.create_dataset('data', data=df.to_numpy().tolist())
doesn't work and results in:
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.
I also tried pytables
and hdf5storage
without much success.
Is there a straight forward way to achieve saving my df
to a hd5
file in a reusable way or should I move to a different file format. If so, which file format would be recommended for my purpose?
I solve my problem with vlen_dtype
# make the data
data = {
'a': [np.array([1.,2.]), np.array([6.,7.,.6]), np.array([np.nan])],
'b': np.array([99., 66., 88.])
}
df = pd.DataFrame(data)
h5file = h5py.File('my.h5', mode='w')
# varlen float datatype
dt = h5py.vlen_dtype(np.dtype('float64'))
# store 'a'
h5file.create_dataset('a', data=df['a'], dtype=dt)
# store 'b'
h5file['b'] = df['b']
This seems to work as intended.