I'm trying to upload an xarray dataset to GCP using the function ds.to_zarr(store=store)
, and it works perfect. However, I would like to show the progress of big datasets. Is there any option to chunk my dataset in a way I can use tqdm
or someting like that to log the uploading progress?
Here is the code that I currently have:
import os
import xarray as xr
import numpy as np
import gcsfs
from dask.diagnostics import ProgressBar
if __name__ == '__main__':
# for testing
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = "service-account.json"
# create xarray
data_arr = np.random.rand(5000, 100, 100)
data_xarr = xr.DataArray(data_arr,
dims=["x", "y", "z"])
# define store
gcp_blob_uri = "gs://gprlib/test.zarr"
gcs = gcsfs.GCSFileSystem()
store = gcs.get_mapper(gcp_blob_uri)
# delayed to_zarr computation -> seems that it does not work
write_job = data_xarr\
.to_dataset(name="data")\
.to_zarr(store, mode="w", compute=False)
print(write_job)
xarray.Dataset.to_zarr
has an optional argument compute which is True
by default:
compute (bool, optional) – If True write array data immediately, otherwise return a
dask.delayed.Delayed
object that can be computed to write array data later. Metadata is always updated eagerly.
Using this, you can track the progress using dask's own dask.distributed.progress
bar:
write_job = ds.to_zarr(store, compute=False)
write_job = write_job.persist()
# this will return an interactive (non-blocking) widget if in a notebook
# environment. To force the widget to block, provide notebook=False.
distributed.progress(write_job, notebook=False)
[############## ] | 35% Completed | 4.5s
Note that for this to work, the dataset must consist of chunked dask arrays. If the data is in memory, you could use a single chunk per array with ds.chunk().to_zarr
.