I am accessing a netcdf file using the xarray python library. The specific file that I am using is publicly available.
So, the file has several variables, and for most of these variables the dimensions are: time: 4314, x: 700, y: 562. I am using the ET_500m variable, but the behaviour is similar for the other variables as well. The chunking is: 288, 36, 44.
I am retrieving a single cell and printing the value using the following code:
import xarray as xr
ds = xr.open_dataset('./dataset_greece.nc')
print(ds.ET_500m.values[0][0][0])
According to my understanding, xarray should locate directly the position of the chunk that contains the corresponding value in disk and read it. Since the size of the chunk should not be bigger than a couple of MBs, I would expect this to take a few seconds or even less. But instead, it takes more than 2 minutes.
If, in the same script, I retrieve the value of another cell, even if it is located in a different chunk (e.g. print(ds.ET_500m.values[1000][500][500])
), then this second retrieval takes only some milliseconds.
So my question is what exactly causes this overhead in the first retrieval?
EDIT: I just saw that in xarray open_dataset there is the optional parameter cache, which according to the manual:
If True, cache data loaded from the underlying datastore in memory as NumPy arrays when accessed to avoid reading from the underlying data- store multiple times. Defaults to True [...]
So, when I set this to False, subsequent fetches are also slow like the first one. But my question remains. Why is this so slow since I am only accessing a single cell. I was expecting that xarray directly locates the chunk on disk and only reads a couple of MBs.
Rather than selecting from the .values
property, subset the array first:
print(ds.ET_500m[0, 0, 0].values)
The problem is that .values
coerces the data to a numpy array, so you're loading all of the data and then subsetting the array. There's no way around this for xarray - numpy doesn't have any concept of lazy loading, so as soon as you call .values
xarray has no option but to load (or compute) all of your data.
If the data is a dask-backed array, you could use .data
rather than .values
to access the dask array and use positional indexing on the dask array, e.g. ds.ET_500m.data[0, 0, 0]
. But if the data is just a lazy-loaded netCDF .data
will have the same load-everything pitfall described above.