python memory-leaks netcdf python-xarray zarr

Using xarray to convert a zarr file to a netcdf causing memory allocation error

I have a zarr file that I'd like to convert to a netcdf which is too large to fit in memory. My computer has 32GB of RAM so writing ~5.5GB chunks shouldn't be a problem. However, within seconds of running this script, my memory usage quickly tops out consuming the available ~20GB and the script fails.

Data: Dropbox link to zarr file containing radar rainfall data for 6/28/2014 over the United States that is around 1.8GB in total.

Code:

import xarray as xr
import zarr

fpath_zarr = "out_zarr_20140628.zarr"

ds_from_zarr = xr.open_zarr(store=fpath_zarr, chunks={'outlat':3500, 'outlon':7000, 'time':30})

ds_from_zarr.to_netcdf("ds_zarr_to_nc.nc", encoding= {"rainrate":{"zlib":True}})

Output:

MemoryError: Unable to allocate 5.48 GiB for an array with shape (30, 3500, 7000) and data type float64

Package versions:

dask                         2022.7.0
xarray                       2022.3.0
zarr                          2.8.1

Solution

See the dask docs on Best Practices with Dask Arrays. The section on "Select a good chunk size" reads:

A common performance problem among Dask Array users is that they have chosen a chunk size that is either too small (leading to lots of overhead) or poorly aligned with their data (leading to inefficient reading).

While optimal sizes and shapes are highly problem specific, it is rare to see chunk sizes below 100 MB in size. If you are dealing with float64 data then this is around (4000, 4000) in size for a 2D array or (100, 400, 400) for a 3D array.

You want to choose a chunk size that is large in order to reduce the number of chunks that Dask has to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. Dask will often have as many chunks in memory as twice the number of active threads.

I imagine the issue here is that 5.48 GB * n_workers * 2 far exceeds your available 32GB of ram, so at any given point in time, one of your workers is reproducibly failing, so dask considers the whole job to be a problem.

The best way to get around this is to reduce your chunk size. As the docs note, the best chunking strategy depends on your workflow, cluster setup, and hardware; that said, in my experience, it's best to keep your number of tasks under 1 million and your chunk size in the ~150MB - 1 GB range.