I have been benchmarking how to store N-dimensional arrays with xarray using either the netCDF or the Zarr file formats as well as all the different encoding options provided with either file format. Zarr seems to generally outperform netCDF with my data and system, but I find something surprising with the resulting files.
The MWE below generates an xarray.Dataset
of 16793600 bytes
import numpy as np
import xarray as xr
import zarr
# Versions (Python 3.11.6)
print(np.__version__) # 1.26.0
print(xr.__version__) # 2023.10.1
print(zarr.__version__) # 2.16.1
rng = np.random.default_rng(0)
t = 10
ds = xr.Dataset(
data_vars=dict(
A=(["y", "x"], rng.normal(size=(2**t,2**t))),
B=(["y", "x"], rng.normal(size=(2**t,2**t))),
),
coords=dict(
x=(["x"], np.linspace(0, 1, 2**t)),
y=(["y"], np.linspace(0, 1, 2**t)),
),
)
print(f'{ds.nbytes}') # 16793600
# To netCDF, using different engines
ds.to_netcdf('file1.nc', engine='netcdf4', encoding=None)
ds.to_netcdf('file2.nc', engine='scipy', encoding=None)
ds.to_netcdf('file3.nc', engine='h5netcdf', encoding=None)
# To Zarr
ds.to_zarr('file.zarr', encoding=None)
yet the files generated are
$ du -bs file*
16801792 file1.nc
16793952 file2.nc
16801792 file3.nc
16017234 file.zarr
That is, the Zarr file is smaller, decreasing by almost 800 kB when stored. And encoding is set to None
, which I understand as 'use no compression'. This might seem a minimal difference, but I am working with 38 GB xarray.Dataset
s. Using the same approach, where encoding=None
, the netCDFs result 38 GB with the netcdf4
or h5netcdf
engines (scipy
fails for some reason) yet the Zarrs only 16 GB, half of it!
How is this possible if no encoding is specified? What is Zarr (or xarray) doing? If it is using any compression, can I avoid it? I noticed saving and reading these large Zarr files, while it takes less time, they use more memory than the netCDF counterparts.
You should double check the encoding of the netcdf4 and zarr data after it has been saved. ncdump -hs
will show you the netcdf encoding and you can open the zarr array metadata JSON file directly. Based on your description, I suspect both formats have some default compression that is being used.
If you want to force Zarr to omit its default compression, you need to set the compressor
encoding argument None
in for each array:
store = {}
ds.to_zarr(store, encoding={"A": {"compressor": None}, "B": {"compressor": None}})
json.loads(store['A/.zarray'])
# yields
{'chunks': [256, 256],
'compressor': None,
'dtype': '<f8',
'fill_value': 'NaN',
'filters': None,
'order': 'C',
'shape': [1024, 1024],
'zarr_format': 2}'