pythonpandashdf5

Write pandas df to reusable hdf5 or use other data format?


I have data that can have different sized arrays per row like:

data = {
    'a': [np.array([1.,2.]), np.array([6.,7.,.6]), np.array([np.nan])],
    'b': np.array([99., 66., 88.])
}
df = pd.DataFrame(data)

I want to save this to a hdf5 file for archiving purposes and be able to reuse it in Matlab.

Saving it with

df.to_hdf('df.h5', mode='w', key='data', format='fixed')

is possible but not reusable, as it saves it in pandas specific format.

Saving it with

df.to_hdf('df.h5', mode='w', key='data', format='table')

is not possible and results in

TypeError: Cannot serialize the column [a]
because its data contents are not [string] but [mixed] object dtype

also trying something like:

with h5py.File('df.h5', 'w') as h5f:
    h5f.create_dataset('data', data=df.to_numpy().tolist())

doesn't work and results in:

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.

I also tried pytables and hdf5storage without much success. Is there a straight forward way to achieve saving my df to a hd5 file in a reusable way or should I move to a different file format. If so, which file format would be recommended for my purpose?


Solution

  • I solve my problem with vlen_dtype

    # make the data
    data = {
        'a': [np.array([1.,2.]), np.array([6.,7.,.6]), np.array([np.nan])],
        'b': np.array([99., 66., 88.])
    }
    df = pd.DataFrame(data)
    
    
    h5file = h5py.File('my.h5', mode='w')
    # varlen float datatype
    dt = h5py.vlen_dtype(np.dtype('float64'))
    # store 'a'
    h5file.create_dataset('a', data=df['a'], dtype=dt)
    # store 'b'
    h5file['b'] = df['b']
    

    This seems to work as intended.