pythonpython-xarraynetcdf4

Why does saving `to_netcdf` without `encoding=` change some values to `nan`?


I'm struggling to understand a problem in my code when saving a xarray.DataSet as netCDF. The file does not contain any nan values. However after saving and loading it suddenly does for one value:

  1. Before: no nan in the original data before saving:

    
    > ds.where(lambda x: x.isnull(), drop=True).coords
    
    Coordinates:
      * x        (x) float64 
      * y        (y) float64 
      * time     (time) datetime64[ns] 
        lon      (x) float64 
        lat      (y) float64 
    
  2. Saving:

    > ds.to_netcdf("manual_save.nc")
    
  3. Loading: Now a nan appears for a single data entry. Only this entry is affected. The effect is reproducible.

    > xr.open_dataset("manual_save.nc").where(lambda x: x.isnull(), drop=True).coords
    
    Coordinates:
      * x        (x) float64 -3.5
      * y        (y) float64 57.0
      * time     (time) datetime64[ns] 2023-02-01
        lon      (x) float64 -3.5
        lat      (y) float64 57.0
    

I don't understand why this is happening, can someone explain and offer a good solution?

More details

  1. Here's the value before and after saving+loading of the affected entry:

    # Before saving+loading
    > ds["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00").load()
    
    <xarray.DataArray 'soil temperature' ()>
    array(275.88766, dtype=float32)
    Coordinates:
        x        float64 -3.5
        y        float64 57.0
        time     datetime64[ns] 2023-02-01
        lon      float64 -3.5
        lat      float64 57.0
    Attributes:
        units:      K
        long_name:  Soil temperature level 4
        module:     era5
        feature:    temperature
    
    # After saving+loading
    > xr.open_dataset("manual_save.nc")["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00").load()
    
    <xarray.DataArray 'soil temperature' ()>
    array(nan, dtype=float32)
    Coordinates:
        x        float64 -3.5
        y        float64 57.0
        time     datetime64[ns] 2023-02-01
        lon      float64 -3.5
        lat      float64 57.0
    Attributes:
        units:      K
        long_name:  Soil temperature level 4
        module:     era5
        feature:    temperature
    
  2. Before saving the data is represented as a dask.array in xarray, requiring the .load() to show the value. Without .load() it looks like this before saving:

    > ds["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00")
    
    <xarray.DataArray 'soil temperature' ()>
    dask.array<getitem, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray>
    Coordinates:
        x        float64 -3.5
        y        float64 57.0
        time     datetime64[ns] 2023-02-01
        lon      float64 -3.5
        lat      float64 57.0
    Attributes:
        units:      K
        long_name:  Soil temperature level 4
        module:     era5
        feature:    temperature
    
  3. Here's a peak at the full xarray.DataSet. No other entries are affected by the problem:

    > ds
    
    <xarray.Dataset>
    Dimensions:           (x: 23, y: 25, time: 48)
    Coordinates:
      * x                 (x) float64 -4.0 -3.75 -3.5 -3.25 ... 0.75 1.0 1.25 1.5
      * y                 (y) float64 56.0 56.25 56.5 56.75 ... 61.5 61.75 62.0
      * time              (time) datetime64[ns] 2023-01-31 ... 2023-02-01T23:00:00
        lon               (x) float64 -4.0 -3.75 -3.5 -3.25 ... 0.75 1.0 1.25 1.5
        lat               (y) float64 56.0 56.25 56.5 56.75 ... 61.5 61.75 62.0
    Data variables:
        temperature       (time, y, x) float32 dask.array<chunksize=(24, 25, 23), meta=np.ndarray>
        soil temperature  (time, y, x) float32 dask.array<chunksize=(24, 25, 23), meta=np.ndarray>
    Attributes:
        module:             era5
        prepared_features:  ['temperature']
        chunksize_time:     100
        Conventions:        CF-1.6
        history:            2023-03-13 09:15:56 GMT by grib_to_netcdf-2.25.1: /op...
    
  4. I can workaround the issue by specifying an compression with zlib via encoding:

    
    > ds.to_netcdf("manual_save_with_zlib.nc", encoding={'soil temperature': {'zlib': True, 'complevel': 1}})
    > xr.open_dataset("manual_save_with_zlib.nc")["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00").load()
    
    <xarray.DataArray 'soil temperature' ()>
    array(275.88766, dtype=float32)
    Coordinates:
        x        float64 -3.5
        y        float64 57.0
        time     datetime64[ns] 2023-02-01
        lon      float64 -3.5
        lat      float64 57.0
    Attributes:
        units:      K
        long_name:  Soil temperature level 4
        module:     era5
        feature:    temperature
    
    
  5. The DataSet is created quite deep inside the code of a library of ours from the online API of ERA5, so I don't know how to create a MWE to share for this issue. The API access and retrieved data all seem to work fine as always.

  6. (edit) As suggested by psalt I tried .compute() before saving and explicitly specifying compute=True while saving to remove this potential dask stumbling block. Neither change the result, after loading the nan value still exist. Here's what I did:

    > ds.compute().to_netcdf("manual_save_pre-compute.nc")
    > ds.to_netcdf("manual_save-and-compute.nc", compute=True)
    
  7. (edit) I also tried saving to zarr but without any success either. The same problem occurs there after loading.

  8. (out of date)

! (edit) I'm sharing the affected DataSet as pickle because all standard methods from xarray interfere with the problem. If you unpickle the version and then save the DataSet as described above, you can reproduce the problem. You can download the pickle file here. ! >! python >! > import pickle >! >! # Code used for creating the pickle >! > f = open("manual_save.pickle", "wb") >! > pickle.dump(ds, f, protocol=pickle.HIGHEST_PROTOCOL) >! > f.close() >! >! # Code for unpickling >! with open("manual_save.pickle", "rb") as f: >! ds = pickle.load(f)~ >!

  1. (edit) I've managed to track down the error to an instable netCDF file. You can download the file here. Tested with xarray=2023.2.0 the following code seems to create a nan value out of thin air:

    import xarray as xr
    
    ds = xr.open_mfdataset("instable-datafile.nc")
    display("This contains no nan values", ds["t2m"].values)
    
    ds.to_netcdf("collapsed-datafile.nc")
    display("This contains nan values", xr.open_dataset("collapsed-datafile.nc")["t2m"].values)
    
    # Output
    
    'This contains no nan values'
    
    array([[[278.03146, 278.4846 ],
            [278.50998, 278.6799 ]],
    
            [[277.91476, 278.4109 ],
            [278.36594, 278.571  ]]], dtype=float32)
    
    'This contains nan values'
    
    array([[[278.03146, 278.4846 ],
            [278.50998, 278.6799 ]],
    
            [[      nan, 278.4109 ],
            [278.36594, 278.571  ]]], dtype=float32)
    

I'm happy to provide more information. Just let me know.


Solution

  • I played around with your file and found the reason for the odd behaviour. This seems to be a bug in xarray, incorrectly handling scaling factor and offset of netCDFs. Deeper down it is related with Python floating point precision. You can find a workaround for your problem at the bottom of the post.

    scale_factor and offset

    Variables in netCDFs can be encoded with the attributes scale_factor and add_offset. This allows to store the data for example as type short, ranging from -32768 to +32767, saving a lot of space over using float. Consider the relevant output of

    $ ncdump instable-datafile.nc

    variables:
            short t2m(time, latitude, longitude) ;
                    t2m:scale_factor = 1.16753614203674e-05 ;
                    t2m:add_offset = 278.297319296597 ;
                    t2m:_FillValue = -32767s ;
                    t2m:missing_value = -32767s ;
                    t2m:units = "K" ;
                    t2m:long_name = "2 metre temperature" ;
    
     t2m =
      -22772, 16038,
      18213, 32767,
      -32766, 9725,
      5877, 23442 ;
    
    
    

    How xarray deals with scale_factor and offset

    The values for the variable t2m are stored as short, when you open the netCDF with xarray, scale_factor and add_offset are applied following the formula:

    value*scale_factor+add_offset

    xarray shows the following content of the t2m:

    import xarray as xr
    >>> ds = xr.open_dataset("instable-datafile.nc")
    >>> ds["t2m"].values
    array([[[278.03146, 278.4846 ],
            [278.50998, 278.6799 ]],
    
           [[277.91476, 278.4109 ],
            [278.36594, 278.571  ]]], dtype=float32)
    
    

    When you save this dataset with xarray using to_netcdf(), it "packs" the values again using the scale_factor and offset, but this time the other way round.

    Let's revert the equation above, apply it on the values and see what happens:

    py
    >>> scale_factor = 1.16753614203674e-05
    >>> add_offset = 278.297319296597
    >>> (ds["t2m"].values-add_offset)/scale_factor
    
    array([[[-22771.812,  16038.549],
            [ 18213.268,  32767.152]],
    
           [[-32767.152,   9726.115],
            [  5875.922,  23440.955]]], dtype=float32)
    
    

    The problem

    As you can see, the numbers changed slightly, compared to the output form ncdump above. This has to do with floating point precision.

    You can see that the dtype of the array is still np.float32. When we save the dataset, t2m gets of course stored again as type short. This is not too problematic, because unpacking these values again will lead to differences in t2m at decimal places we don't care about. In fact we do not even notice, xarray shows you just the first 4 or 5 decimals.

    Now, why is this specific value NaN? This has to do with with the _FillValue and the MissingValue attributes. For t2m, the _FillValue is set to -32767. If you look at the calculation above, this is exactly the value that is packed by xarray's .to_netcdf()

    A workaround

    If we convert the data to float64, we can increase the precision of the data and prevent the unwanted behaviour, however this has the downside that we lose the encoding of "t2m", including scale_factor and add_offset. This is undesirably, as we cannot preserve the compression.

    >>> ds["t2m"] = ds["t2m"].astype(np.float64)
    >>> ds["t2m"]
    array([[[278.0314636230469, 278.4845886230469],
            [278.5099792480469, 278.6799011230469]],
    
           [[277.9147644042969, 278.410888671875 ],
            [278.3659362792969, 278.5710144042969]]])
    
    >>> ds.to_netcdf("stable-datafile.nc")
    

    ncdump output:

     t2m =
      278.031463623047, 278.484588623047,
      278.509979248047, 278.679901123047,
      277.914764404297, 278.410888671875,
      278.365936279297, 278.571014404297 ;
    

    There is an ongoing issue in the xarray repository, so this problem might have a better solution in the future.