I would like to initialize a very large XArray dataset (as on-disk Zarr if possible) for later processing - various parts (spatial subsets) of the dataset will be populated by a different script.
This won't work, because the dataset obviously doesn't fit into memory.
import numpy as np
import xarray as xr
xr_lons = xr.DataArray(np.arange(-180, 180, 0.001), dims=['x'], name='lons')
xr_lats = xr.DataArray(np.arange(90, -90, -0.001), dims=['y'], name='lats')
xr_da = xr.DataArray(0, dims=['y', 'x'], coords=[xr_lats, xr_lons])
xr_ds = xr.Dataset({"test": xr_da})
xr_ds.to_zarr("test.zarr", mode="w")
MemoryError: Unable to allocate 483. GiB for an array with shape (180000, 360000) and data type int64
What would be a good alternative?
I'm looking for a solution like this using plain zarr:
import zarr
root = zarr.open('example.zarr', mode='w')
mosaics = root.create_group('mosaics')
dsm = mosaics.create_dataset('dsm', shape=(nrows, ncols), chunks=(1024, 1024), dtype='i4')
The dataset, however, doesn't conform to the normal XArray structure and metadata so I'm looking for a solution directly using XArray (or Dask?)
You can use dask
to "initialize" an xarray
Dataset which is backed by a lazy dask
array and then write it to disk in chunks concurrently.
import dask.array as da
import numpy as np
import xarray as xr
lat = np.arange(90, -90, -0.001)
lon = np.arange(-180, 180, 0.001)
x = xr.Dataset(
coords={
"lat": (["lat"], lat),
"lon": (["lon"], lon),
},
data_vars={
"dsm": (
["lat", "lon"],
da.zeros((lat.size, lon.size), chunks=(1024, 1024), dtype="uint8"),
)
},
)
This gives you an xarray
dataset with a dask array dsm
:
You can now do any kind of further modification or just write it to disk using .to_zarr
.
A couple of notes:
Your array is fairly large, choosing the correct datatype is crucial. If not specified it'll use float64
which will be 483. GiB
for your array. If you specify for instance uint8
like in my example, it'll be only 60.35 GiB
for the same size.
Another crucial factor is the chunk size. Using (1024, 1024)
from your example yields very small chunks which means a ton of tasks which will likely result in issues for either dask or your disk I/O.