python-3.xpandaspython-xarraynetcdf4

ValueError while saving xarray dataset to netcdf to add meta data along with pandas dataframe


I want to attach some meta data with a pandas dataframe. The meta data is like adding some description about what all data processing was done before saving the dataframe.

I came across this solution: https://stackoverflow.com/a/52546933/214526

So, I have tried followings:

xarrDS: xarray.Dataset = pdDF.to_xarray()
xarrDS.attrs["description"] = "Some description about data processing"
## here if I display xarrDS in the notebook, it shows the data correctly"

xarrDS.to_netcdf(path="processed_df.nc")

But this save to netcdf causes this exception:

ValueError: setting an array element with a sequence

The pandas dataframe does not have any NaN values. I do not find any relevant solutions online. I see that this article also is saving it using similar code.

Any pointer to how to resolve this or alternative solution (without using additional mlops libraries) to save the meta data will be appreciated.

My versions for the libraries are following:

pandas=1.5.3
xarray=2022.11.0
netcdf4=1.6.3

Solution

  • The likely reason for that error is that in your pandas dataframe you have some columns which are of type object, so something like columns with strings. So the automatic conversion might have some problems assigning that datatype to one of the supported NetCDF4 datatypes.

    I tested it myself, strings work without any issue. What will give you problems are columns that have lists or arrays in the cells. And here you are out of luck, because the netCDF4 specification simply does not support saving those datatypes.

    data = {
      "calories": [420, 380, 390],
      "duration": [50.4, 40.2, 45.7],
      "type": ["a", "foo", "bar10"],
      # "arrays": [np.arange(4), np.arange(3), np.arange(2)],
      "lists": [[1,2], [3,4], [5,6]]
    }
    
    #load data into a DataFrame object:
    df = pd.DataFrame(data)
    df.dtypes
    

    You can try it out without the lists and arrays column, which will work. But with one of them it will give you the error you are getting:

    ds = df.to_xarray()
    ds.to_netcdf("test.nc")
    

    For that it doesn't matter if you saved an attribute or not.