I'm struggling to understand a problem in my code when saving a xarray.DataSet
as netCDF
. The file does not contain any nan
values. However after saving and loading it suddenly does for one value:
Before: no nan
in the original data before saving:
> ds.where(lambda x: x.isnull(), drop=True).coords
Coordinates:
* x (x) float64
* y (y) float64
* time (time) datetime64[ns]
lon (x) float64
lat (y) float64
Saving:
> ds.to_netcdf("manual_save.nc")
Loading: Now a nan
appears for a single data entry. Only this entry is affected. The effect is reproducible.
> xr.open_dataset("manual_save.nc").where(lambda x: x.isnull(), drop=True).coords
Coordinates:
* x (x) float64 -3.5
* y (y) float64 57.0
* time (time) datetime64[ns] 2023-02-01
lon (x) float64 -3.5
lat (y) float64 57.0
I don't understand why this is happening, can someone explain and offer a good solution?
More details
Here's the value before and after saving+loading of the affected entry:
# Before saving+loading
> ds["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00").load()
<xarray.DataArray 'soil temperature' ()>
array(275.88766, dtype=float32)
Coordinates:
x float64 -3.5
y float64 57.0
time datetime64[ns] 2023-02-01
lon float64 -3.5
lat float64 57.0
Attributes:
units: K
long_name: Soil temperature level 4
module: era5
feature: temperature
# After saving+loading
> xr.open_dataset("manual_save.nc")["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00").load()
<xarray.DataArray 'soil temperature' ()>
array(nan, dtype=float32)
Coordinates:
x float64 -3.5
y float64 57.0
time datetime64[ns] 2023-02-01
lon float64 -3.5
lat float64 57.0
Attributes:
units: K
long_name: Soil temperature level 4
module: era5
feature: temperature
Before saving the data is represented as a dask.array
in xarray
, requiring the .load()
to show the value. Without .load()
it looks like this before saving:
> ds["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00")
<xarray.DataArray 'soil temperature' ()>
dask.array<getitem, shape=(), dtype=float32, chunksize=(), chunktype=numpy.ndarray>
Coordinates:
x float64 -3.5
y float64 57.0
time datetime64[ns] 2023-02-01
lon float64 -3.5
lat float64 57.0
Attributes:
units: K
long_name: Soil temperature level 4
module: era5
feature: temperature
Here's a peak at the full xarray.DataSet
. No other entries are affected by the problem:
> ds
<xarray.Dataset>
Dimensions: (x: 23, y: 25, time: 48)
Coordinates:
* x (x) float64 -4.0 -3.75 -3.5 -3.25 ... 0.75 1.0 1.25 1.5
* y (y) float64 56.0 56.25 56.5 56.75 ... 61.5 61.75 62.0
* time (time) datetime64[ns] 2023-01-31 ... 2023-02-01T23:00:00
lon (x) float64 -4.0 -3.75 -3.5 -3.25 ... 0.75 1.0 1.25 1.5
lat (y) float64 56.0 56.25 56.5 56.75 ... 61.5 61.75 62.0
Data variables:
temperature (time, y, x) float32 dask.array<chunksize=(24, 25, 23), meta=np.ndarray>
soil temperature (time, y, x) float32 dask.array<chunksize=(24, 25, 23), meta=np.ndarray>
Attributes:
module: era5
prepared_features: ['temperature']
chunksize_time: 100
Conventions: CF-1.6
history: 2023-03-13 09:15:56 GMT by grib_to_netcdf-2.25.1: /op...
I can workaround the issue by specifying an compression with zlib
via encoding
:
> ds.to_netcdf("manual_save_with_zlib.nc", encoding={'soil temperature': {'zlib': True, 'complevel': 1}})
> xr.open_dataset("manual_save_with_zlib.nc")["soil temperature"].sel(x=-3.5, y=57, time="2023-02-01 00:00").load()
<xarray.DataArray 'soil temperature' ()>
array(275.88766, dtype=float32)
Coordinates:
x float64 -3.5
y float64 57.0
time datetime64[ns] 2023-02-01
lon float64 -3.5
lat float64 57.0
Attributes:
units: K
long_name: Soil temperature level 4
module: era5
feature: temperature
The DataSet is created quite deep inside the code of a library of ours from the online API of ERA5, so I don't know how to create a MWE to share for this issue. The API access and retrieved data all seem to work fine as always.
(edit) As suggested by psalt I tried .compute()
before saving and explicitly specifying compute=True
while saving to remove this potential dask
stumbling block. Neither change the result, after loading the nan
value still exist. Here's what I did:
> ds.compute().to_netcdf("manual_save_pre-compute.nc")
> ds.to_netcdf("manual_save-and-compute.nc", compute=True)
(edit) I also tried saving to zarr
but without any success either. The same problem occurs there after loading.
(out of date)
! (edit) I'm sharing the affected
DataSet
aspickle
because all standard methods fromxarray
interfere with the problem. If you unpickle the version and then save the DataSet as described above, you can reproduce the problem. You can download the pickle file here. ! >!python >! > import pickle >! >! # Code used for creating the pickle >! > f = open("manual_save.pickle", "wb") >! > pickle.dump(ds, f, protocol=pickle.HIGHEST_PROTOCOL) >! > f.close() >! >! # Code for unpickling >! with open("manual_save.pickle", "rb") as f: >! ds = pickle.load(f)~ >!
(edit) I've managed to track down the error to an instable netCDF
file. You can download the file here. Tested with xarray=2023.2.0
the following code seems to create a nan
value out of thin air:
import xarray as xr
ds = xr.open_mfdataset("instable-datafile.nc")
display("This contains no nan values", ds["t2m"].values)
ds.to_netcdf("collapsed-datafile.nc")
display("This contains nan values", xr.open_dataset("collapsed-datafile.nc")["t2m"].values)
# Output
'This contains no nan values'
array([[[278.03146, 278.4846 ],
[278.50998, 278.6799 ]],
[[277.91476, 278.4109 ],
[278.36594, 278.571 ]]], dtype=float32)
'This contains nan values'
array([[[278.03146, 278.4846 ],
[278.50998, 278.6799 ]],
[[ nan, 278.4109 ],
[278.36594, 278.571 ]]], dtype=float32)
I'm happy to provide more information. Just let me know.
I played around with your file and found the reason for the odd behaviour. This seems to be a bug in xarray, incorrectly handling scaling factor and offset of netCDFs. Deeper down it is related with Python floating point precision. You can find a workaround for your problem at the bottom of the post.
scale_factor and offset
Variables in netCDFs can be encoded with the attributes scale_factor
and add_offset
. This allows to store the data for example as type short
, ranging from -32768 to +32767, saving a lot of space over using float
.
Consider the relevant output of
$ ncdump instable-datafile.nc
variables:
short t2m(time, latitude, longitude) ;
t2m:scale_factor = 1.16753614203674e-05 ;
t2m:add_offset = 278.297319296597 ;
t2m:_FillValue = -32767s ;
t2m:missing_value = -32767s ;
t2m:units = "K" ;
t2m:long_name = "2 metre temperature" ;
t2m =
-22772, 16038,
18213, 32767,
-32766, 9725,
5877, 23442 ;
How xarray deals with scale_factor and offset
The values for the variable t2m
are stored as short
, when you open the netCDF with xarray, scale_factor and add_offset are applied following the formula:
value*scale_factor+add_offset
xarray shows the following content of the t2m
:
import xarray as xr
>>> ds = xr.open_dataset("instable-datafile.nc")
>>> ds["t2m"].values
array([[[278.03146, 278.4846 ],
[278.50998, 278.6799 ]],
[[277.91476, 278.4109 ],
[278.36594, 278.571 ]]], dtype=float32)
When you save this dataset with xarray using to_netcdf()
, it "packs" the values again using the scale_factor and offset, but this time the other way round.
Let's revert the equation above, apply it on the values and see what happens:
py
>>> scale_factor = 1.16753614203674e-05
>>> add_offset = 278.297319296597
>>> (ds["t2m"].values-add_offset)/scale_factor
array([[[-22771.812, 16038.549],
[ 18213.268, 32767.152]],
[[-32767.152, 9726.115],
[ 5875.922, 23440.955]]], dtype=float32)
The problem
As you can see, the numbers changed slightly, compared to the output form ncdump above. This has to do with floating point precision.
You can see that the dtype of the array is still np.float32
. When we save the dataset, t2m gets of course stored again as type short
.
This is not too problematic, because unpacking these values again will lead to differences in t2m
at decimal places we don't care about. In fact we do not even notice, xarray shows you just the first 4 or 5 decimals.
Now, why is this specific value NaN
? This has to do with with the _FillValue
and the MissingValue
attributes.
For t2m
, the _FillValue
is set to -32767. If you look at the calculation above, this is exactly the value that is packed by xarray's .to_netcdf()
A workaround
If we convert the data to float64
, we can increase the precision of the data and prevent the unwanted behaviour, however this has the downside that we lose the encoding of "t2m", including scale_factor
and add_offset
. This is undesirably, as we cannot preserve the compression.
>>> ds["t2m"] = ds["t2m"].astype(np.float64)
>>> ds["t2m"]
array([[[278.0314636230469, 278.4845886230469],
[278.5099792480469, 278.6799011230469]],
[[277.9147644042969, 278.410888671875 ],
[278.3659362792969, 278.5710144042969]]])
>>> ds.to_netcdf("stable-datafile.nc")
ncdump
output:
t2m =
278.031463623047, 278.484588623047,
278.509979248047, 278.679901123047,
277.914764404297, 278.410888671875,
278.365936279297, 278.571014404297 ;
There is an ongoing issue in the xarray repository, so this problem might have a better solution in the future.