python-xarraynetcdfzarr

Do zarr arrays natively support integer scaling and offsets like NetCDF? If not, is there a workaround?


I have a bunch of NetCDF (.nc) files (ERA5 dataset) that I'm reading in Python through xarray and rioxarray. They end up as arrays of float32 (4 bytes) in memory.

However, on disk they are stored as short (2 bytes):

$ ncdump -h file.nc
...
    short u100(time, latitude, longitude) ;
        u100:scale_factor = 0.000895262699529722 ;
        u100:add_offset = 2.29252111865024 ;
        u100:_FillValue = -32767s ;
        u100:missing_value = -32767s ;
...

Apparently xarray automatically applies the offset and scale factor to convert these integers back into floats while reading the NetCDF file.

Now I'm rechunking these and storing them as zarr, so I can efficiently access entire time series at a single geographical location. However, the zarr files end up at almost twice the size of the original NetCDFs, because the data remain stored as floats. Because it's about a terabyte in its original form, bandwidth and storage considerations are important, so I'd like to make this smaller. And we're not gaining anything by this additional storage size; the incoming data only had 16 bits of precision to begin with.

I know I could just manually convert the data back to shorts on the way into zarr, and back to floats on the way out of zarr, but that's tedious and error-prone (even when it happens automatically).

Is there a way to do this transparently, the way it seems to happen with NetCDF?


Solution

  • The problem with the xarray method is that you will need to always open it with xarray, instead can use Zarr codec FixedScaleOffset via numcodes.zarr3:

    import xarray as xr
    import numcodecs.zarr3
    
    ds = xr.Dataset(data_vars={'var': xr.DataArray([0.15, 1.15, 2.30])}).astype('float32')
    
    filter = numcodecs.zarr3.FixedScaleOffset(
        offset=0, 
        scale=100, 
        dtype='float32', 
        astype='int16',
    )
    encoding = {'var': { 'filters': [filter] }}
    
    ds.to_zarr('/tmp/test.zarr',  encoding=encoding, mode='w', zarr_format=3)
    

    If you encounter the recent bug in zarr-python Expected a ArrayArrayCodec pin dependency to zarr-python=3.0.*.

    For Zarr format v2 see zarr.codecs.FixedScaleOffset