pythonpython-xarrayzarr

How can an xarray-generated Zarr file with no encoding use less disk space than its actual data size?


I have been benchmarking how to store N-dimensional arrays with xarray using either the netCDF or the Zarr file formats as well as all the different encoding options provided with either file format. Zarr seems to generally outperform netCDF with my data and system, but I find something surprising with the resulting files.

The MWE below generates an xarray.Dataset of 16793600 bytes

import numpy as np
import xarray as xr
import zarr

# Versions (Python 3.11.6)
print(np.__version__)   # 1.26.0
print(xr.__version__)   # 2023.10.1
print(zarr.__version__) # 2.16.1

rng = np.random.default_rng(0)

t = 10
ds = xr.Dataset(
    data_vars=dict(
        A=(["y", "x"], rng.normal(size=(2**t,2**t))),
        B=(["y", "x"], rng.normal(size=(2**t,2**t))),
    ),
    coords=dict(
        x=(["x"], np.linspace(0, 1, 2**t)),
        y=(["y"], np.linspace(0, 1, 2**t)),
    ),
)
print(f'{ds.nbytes}') # 16793600

# To netCDF, using different engines
ds.to_netcdf('file1.nc', engine='netcdf4', encoding=None)
ds.to_netcdf('file2.nc', engine='scipy', encoding=None)
ds.to_netcdf('file3.nc', engine='h5netcdf', encoding=None)

# To Zarr
ds.to_zarr('file.zarr', encoding=None)

yet the files generated are

$ du -bs file*
16801792        file1.nc
16793952        file2.nc
16801792        file3.nc
16017234        file.zarr

That is, the Zarr file is smaller, decreasing by almost 800 kB when stored. And encoding is set to None, which I understand as 'use no compression'. This might seem a minimal difference, but I am working with 38 GB xarray.Datasets. Using the same approach, where encoding=None, the netCDFs result 38 GB with the netcdf4 or h5netcdf engines (scipy fails for some reason) yet the Zarrs only 16 GB, half of it!

How is this possible if no encoding is specified? What is Zarr (or xarray) doing? If it is using any compression, can I avoid it? I noticed saving and reading these large Zarr files, while it takes less time, they use more memory than the netCDF counterparts.


Solution

  • You should double check the encoding of the netcdf4 and zarr data after it has been saved. ncdump -hs will show you the netcdf encoding and you can open the zarr array metadata JSON file directly. Based on your description, I suspect both formats have some default compression that is being used.

    If you want to force Zarr to omit its default compression, you need to set the compressor encoding argument None in for each array:

    store = {}
    ds.to_zarr(store, encoding={"A": {"compressor": None}, "B": {"compressor": None}})
    json.loads(store['A/.zarray'])
    
    # yields
    {'chunks': [256, 256],
     'compressor': None,
     'dtype': '<f8',
     'fill_value': 'NaN',
     'filters': None,
     'order': 'C',
     'shape': [1024, 1024],
     'zarr_format': 2}'