pythondaskpython-xarraydask-delayed

Can you use xarray inside of dask delayed functions


I want to use dask.delayed in my work. Some of the functions I want to apply it to will work with xarray based data. I know xarray itself can use dask under the hood. For dask itself, it is recommended you don't pass dask arrays to delayed functions. I want to make sure that I don't hit any similar pitfalls by passing xarray objects to delayed function.

My current assumption is I should avoid creating dask-backed xarray datasets, which I think will be achieved by avoiding the chunk keyword and not using open_mfdataset. My question is ifs this restriction necessary and sufficient to avoid problems?

To give a concrete (but heavily simplified) example of the type of code I'm thinking to write, I'm thinking of something like this. The idea is that dask.delayed handles the parallelism, not xarray.

@dask.delayed
def pow2(ds):
   return ds**2

delayed_ds = dask.delayed(xr.open_dataset)('myfile.nc')
delayed_ds_squared = pow2(delayed_ds)
ds_squared = dask.compute(delayed_ds)

Solution

  • As long as your xarray dataset doesn't use dask as a backend itself, it can be easily passed as a parameter to dask.delayed.

    The only pitfall is that compute() applied to a dask.delayed object by default uses the multiprocessing scheduler (unless a distributed.Client has been instantiated). Xarray data, being numpy-based, largely releases the GIL so it would heavily benefit from multithreading. You should either use dask.distributed or explicitly pass compute(scheduler="threaded").