I want to save a very large zarr file (2 dimensional) chunked equally along both dimensions (X, X) occationally containing chunks made of all nans. To reduce the amount of chunks written to disk, I want xarray's to_zarr
method to skip writing this chunk to disk at all.
Here is some code to emulate it:
import numpy as np
import xarray as xr
n = 100 # this could get as large as 400K, leaving it small for simplicity
n_chunk = 50 # chunk size
n_delete = 1 # number of random chunks to change to nans
lat = np.linspace(1, 2, n)
lon = np.linspace(1, 2, n)
data = np.random.random((n, n))
all_c = list()
for i in np.arange(n//n_chunk):
for j in np.arange(n//n_chunk):
all_c.append((i, j))
delete = np.array(all_c)[np.random.choice(np.arange(len(all_c)), n_delete)]
print(np.unique(delete, axis=1).shape)
for i, j in delete:
j = j if j - 1 > 0 else 1
i = i if i - 1 > 0 else 1
data[(i - 1) * n_chunk:i * n_chunk, (j - 1) * n_chunk:j * n_chunk] = np.nan
xarr = xr.DataArray(data=data, name="test", dims=["lat", "lon"], coords=dict(lat=lat, lon=lon))
xarr = xarr.chunk((n_chunk, n_chunk))
xarr.to_dataset().to_zarr(r"C:/experiment.zarr", mode="w", encoding={"test": {"_FillValue": None}})
This would write all the chunks (in above case 4 chunks) to the disk (all nans is still a valid float). How can I stop it from writing the all nans chunk?
Xarray can utilize Zarr's write_empty_chunks
option. You can add this to your variable encoding:
ds.to_zarr(..., encoding={"test": {..., 'write_empty_chunks': False}})