pythonparallel-processingdaskdask-distributeddask-delayed

Dask run all combination of elements in different lists in parallel


I'm trying to run a function on different combination of all the elements in different arrays with dask, and I'm struggling to apply it.

The serial code is as below:

for i in range(5):
    for j in range(5):
        for k in range(5):
            function(listA[i],listB[j],listC[k])
            print(f'{i}.{j}.{k}')
            k=k+1
        j=j+1
    i=i+1

This code running time on my computer is 18 min, while each array has only 5 elements, i want to run this code parallel with dask on bigger size of arrays. All the calculations inside the function doesn't independent on one another. You can assume that what the function does is: listA[i]*listB[j]*listC[k]

After searching a lot online, i couldn't find any solution. Much appreciate.


Solution

  • The snippet can be improved before using dask. Instead of iterating over index and then looking up the corresponding item in a list, one could iterate over the list directly (i.e. use for item in list_A:). Since in this case we are interested in all combinations of items in three lists, we can make use of the built-in combinations:

    from itertools import combinations
    triples = combinations(list_A, list_B, list_C)
    for i,j,k in triples:
         function(i,j,k)
    

    To use dask one option is to use the delayed API. By wrapping function with dask.delayed, we obtain an immediate lazy reference to the results of the function. After collecting all the lazy references we can compute them in parallel with dask.compute:

    import dask
    from itertools import combinations
    
    triples = combinations(list_A, list_B, list_C)
    delayeds = [dask.delayed(function)(i,j,k)for i,j,k in triples]
    results = dask.compute(*delayeds)