daskpython-xarrayzarr

Adding new Xarray DataArray to an existing Zarr store without re-writing the whole dataset?


How do I add a new DataArray to an existing Dataset without overwriting the whole thing? The new DataArray shares some coordinates with the existing one, but also has new ones. In my current implementation, the Dataset gets completely overwritten, instead of just adding the new stuff.

The existing DataArray is a chunked zarr-backed DirectoryStore (though I have the same problem for an S3 store).

import numpy as np
import xarray as xr
import zarr

arr1 = xr.DataArray(np.random.randn(2, 3),
                   [('x', ['a', 'b']), ('y', [10, 20, 30])],
                   name='arr1')

ds = arr1.chunk({'x': 1, 'y': 3}).to_dataset()

ds looks like this:

<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
Data variables:
    arr1     (x, y) float64 dask.array<shape=(2, 3), chunksize=(1, 3)>

I write it to a directory store:

store = zarr.DirectoryStore('test.zarr')
z = ds.to_zarr(store, group='arr', mode='w')

It looks good:

$ ls -l test.zarr/arr
total 0
drwxr-xr-x  6 myuser  mygroup  204 Sep 21 11:03 arr1
drwxr-xr-x  5 myuser  mygroup  170 Sep 21 11:03 x
drwxr-xr-x  5 myuser  mygroup  170 Sep 21 11:03 y

I create a new DataArray that shares some coordinates with the existing one, and add it to the existing Dataset. I'll read the existing Dataset first, since that's what I'm doing in practice.

ds2 = xr.open_zarr(store, group='arr')
arr2 = xr.DataArray(np.random.randn(2, 3),
                   [('x', arr1.x), ('z', [1, 2, 3])],
                   name='arr2')
ds2['arr2'] = arr2

The updated Dataset looks fine:

<xarray.Dataset>
Dimensions:  (x: 2, y: 3, z: 3)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 10 20 30
  * z        (z) int64 1 2 3
Data variables:
    arr1     (x, y) float64 dask.array<shape=(2, 3), chunksize=(1, 3)>
    arr2     (x, z) float64 0.4728 1.118 0.7275 0.4971 -0.3398 -0.3846

...but I can't write to it without a complete overwrite.

# I think I'm "appending" to the group `arr`
z2 = ds2.to_zarr(store, group='arr', mode='a')

This gives me a ValueError: The only supported options for mode are 'w' and 'w-'.

# I think I'm "creating" the new arr2 array in the arr group
z2 = ds2.to_zarr(store, group='arr', mode='w-')

This gives me ValueError: path 'arr' contains a group.

The only thing that worked is z2 = ds2.to_zarr(store, group='arr', mode='w'), but this completely overwrites the group.

The original DataArray is actually quite large in my problem, so I really don't want to re-write it. Is there a way to only write the new DataArray?

Thank you!


Solution

  • The existing answers are out of date: mode="a" is now supported in xarray. See the documentation:

    Xarray supports several ways of incrementally writing variables to a Zarr store. These options are useful for scenarios when it is infeasible or undesirable to write your entire dataset at once.

    1. Use mode='a' to add or overwrite entire variables,
    2. Use append_dim to resize and append to existing variables, and
    3. Use region to write to limited regions of existing arrays.