I've got a function that effectively generates an image and stores in to the disk. The function has no arguments:
def generate_and_save():
pass # generate and store image
I need to generate a large number of images (say 100k), so I opt for Dask. Having read up the manuals, I've put together a function that creates a distributed (local) client and executes the task with several processes, like so:
from dask.distributed import Client as DaskClient
def generate_images(how_many=10000, processes=6):
# start Dask distributed client with 1 thread per process
client = DaskClient(n_workers=processes, threads_per_worker=1)
# submit future functions to cluster
futures = []
for i in range(how_many):
futures.append(client.submit(generate_and_save, pure=False))
# execute and compute results (synchronous / blocking!)
results = client.gather(futures)
print(len(results))
# stop & release client
client.close()
generate_images(50000)
As you see, the 'futures' are submitted to the server in a for loop and stored in a simple list. The question is: is there a more efficient way of adding and executing the futures in this case? Like, e.g., parallelizing the submitting procedure itself?
Nope. This looks pretty good. I wouldn't expect the overhead to take too long, probably somewhere under 1ms per task, so 10s
If this overhead is a long time, then you might want to read this doc section: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-graphs