[SOLVED] Dask vs simple sequential app : writing unbounded data to a - single

Dask vs simple sequential app : writing unbounded data to a - single - NetCDF

I need to export all my unbounded (growing, 1TB as of now) data to a single NetCDF4 file.

The full ETL consists of :

variety of computations
then writing all the results to a single NetCDF file

The last part (writing to a single NetCDF file) is what I'm focusing on.

My needs :

minimal duration of write to NetCDF OP
physical memory constraints, and largely lower than the entire dataset itself (so I need to avoid for each variable to allocate the whole ndarray at once)

It seems there are related discussion on it, here in 2018 : https://github.com/dask/distributed/issues/2163

Question

As of 2024, are there any benefits (increased parallelism ? chunk by chunk write and thus few memory consumption ?) of using Dask for writing data to a single NetCDF file, comparing to a simple (single-threaded) Python sequential app using standard NetCDF libraries ?

Solution

The standard dask.array's to_hdf will create and fill an array in your target file. HDF5 supports assigning an array and filling in the chunks as you go. If you can phrase your procedure as a set of dask.array operations (e.g., read from files, each file become a chunk of the array), then this is all that you need to get chunkwise operation and low memory usage https://docs.dask.org/en/latest/generated/dask.array.to_hdf5.html

In cases where you can't use dask.array (perhaps you are coming back in another session with more data), you can still open the file/variable in append mode, and use dask.delayed to write each chunk. Note that the internal chunking of the array will be fixed at initial creation time, along with other options such as compression.