pythonpython-xarrayzarr

printing the uploading progress using `xarray.Dataset.to_zarr` function


I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store), and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm or someting like that to log the uploading progress?

Here is the code that I currently have:

import os

import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar

if __name__ == '__main__':
    # for testing
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"

    # create xarray
    data_arr = np.random.rand(5000, 100, 100)
    data_xarr = xr.DataArray(data_arr,
                             dims=["x", "y", "z"])

    # define store
    gcp_blob_uri = "gs://gprlib/test.zarr"
    gcs = gcsfs.GCSFileSystem()
    store = gcs.get_mapper(gcp_blob_uri)

    # delayed to_zarr computation -> seems that it does not work
    write_job = data_xarr\
        .to_dataset(name="data")\
        .to_zarr(store, mode="w", compute=False)

    print(write_job)

Solution

  • xarray.Dataset.to_zarr has an optional argument compute which is True by default:

    compute (bool, optional) – If True write array data immediately, otherwise return a dask.delayed.Delayed object that can be computed to write array data later. Metadata is always updated eagerly.

    Using this, you can track the progress using dask's own dask.distributed.progress bar:

    write_job = ds.to_zarr(store, compute=False)
    write_job = write_job.persist()
    
    # this will return an interactive (non-blocking) widget if in a notebook
    # environment. To force the widget to block, provide notebook=False.
    distributed.progress(write_job, notebook=False)
    [##############                          ] | 35% Completed |  4.5s
    

    Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr.