pythonpandashdf

Pandas to_hdf() TypeError: object of type 'int' has no len()


I would like to store a pandas DataFrame such that when I later load it again, I only load certain columns of it and not the entire thing. Therefore, I am trying to store a pandas DataFrame in hdf format. The DataFrame contains a numpy array and I get the following error message.

Any idea on how to get rid of the error or what format I could use instead?

CODE:

import pandas as pd
import numpy as np

df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4]})
df["c"] = [np.ones((4,4)) for i in range(4)]
df.to_hdf("test.h5", "df", format='table', data_columns=True)

ERROR:

TypeError                                 Traceback (most recent call last)
<ipython-input-2-ace42e5ccbb7> in <module>
----> 1 df.to_hdf("test.h5", "df", format='table', data_columns=True)

/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in to_hdf(self, path_or_buf, key, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
   2619             data_columns=data_columns,
   2620             errors=errors,
-> 2621             encoding=encoding,
   2622         )
   2623 

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in to_hdf(path_or_buf, key, value, mode, complevel, complib, append, format, index, min_itemsize, nan_rep, dropna, data_columns, errors, encoding)
    278             path_or_buf, mode=mode, complevel=complevel, complib=complib
    279         ) as store:
--> 280             f(store)
    281     else:
    282         f(path_or_buf)

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in <lambda>(store)
    270             errors=errors,
    271             encoding=encoding,
--> 272             dropna=dropna,
    273         )
    274 

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in put(self, key, value, format, index, append, complib, complevel, min_itemsize, nan_rep, data_columns, encoding, errors, track_times, dropna)
   1104             errors=errors,
   1105             track_times=track_times,
-> 1106             dropna=dropna,
   1107         )
   1108 

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in _write_to_group(self, key, value, format, axes, index, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, encoding, errors, track_times)
   1753             nan_rep=nan_rep,
   1754             data_columns=data_columns,
-> 1755             track_times=track_times,
   1756         )
   1757 

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in write(self, obj, axes, append, complib, complevel, fletcher32, min_itemsize, chunksize, expectedrows, dropna, nan_rep, data_columns, track_times)
   4222             min_itemsize=min_itemsize,
   4223             nan_rep=nan_rep,
-> 4224             data_columns=data_columns,
   4225         )
   4226 

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in _create_axes(self, axes, obj, validate, nan_rep, data_columns, min_itemsize)
   3892                 nan_rep=nan_rep,
   3893                 encoding=self.encoding,
-> 3894                 errors=self.errors,
   3895             )
   3896             adj_name = _maybe_adjust_name(new_name, self.version)

/opt/conda/lib/python3.7/site-packages/pandas/io/pytables.py in _maybe_convert_for_string_atom(name, block, existing_col, min_itemsize, nan_rep, encoding, errors)
   4885         # we cannot serialize this data, so report an exception on a column
   4886         # by column basis
-> 4887         for i in range(len(block.shape[0])):
   4888             col = block.iget(i)
   4889             inferred_type = lib.infer_dtype(col, skipna=False)

TypeError: object of type 'int' has no len()

Solution

  • Pandas seems to have trouble serializing the numpy array in your dataframe. So I would suggest storing the numpy data in a seperate *.h5 file.

    import pandas as pd
    import numpy as np
    import h5py
    
    df = pd.DataFrame({"a": [1,2,3,4], "b": [1,2,3,4]})
    df.to_hdf("pandas_data.h5", "df", format='table', data_columns=True)
    
    c =  [np.ones((4,4)) for i in range(4)]
    with h5py.File('numpy_data.h5', 'w') as hf:
        hf.create_dataset('dataset_1', data=c)
    

    You can then load that data back in using: '

    with h5py.File('numpy_data.h5', 'r') as hf:
        c_out = hf['dataset_1'][:]
    
    df = pd.read_hdf('pandas_data.h5', 'df')
    df['c'] = list(c_out)