[SOLVED] dask delayed functions on pandas groupby objects

dask delayed functions on pandas groupby objects

I couldn't figure out how to compute delayed objects coming from df.groupy.apply() operation. I really appreciate if someone can help. Here is a sample code I wrote

import pandas as pd
import dask
df = pd.DataFrame(columns=['id','id2','val1'])
df['id'] = ['A','A','A','B','C','C','D','D']
df['id2']=['a','a','b','a','a','b','b','b']
df['val1']= [1,2,3,4,5,6,7,8]
@dask.delayed
def dask_test(group,val_col):
    for idx,row in group.iterrows():
        group.loc[idx,'test']=2*group.loc[idx,val_col]
    return group

tmp_grp = df.groupby(['id','id2']).apply(dask_test,'val1')

The output of tmp_grp is

id  id2
A   a      Delayed('copy-f0e26845-fc3a-4bb7-8609-47b923c0...
    b      Delayed('copy-9b6cecf5-9fa4-4301-ba2d-dec5478d...
B   a      Delayed('copy-7b538f4b-ac3f-4c83-b37b-e620d0ba...
C   a      Delayed('copy-c722fa78-c46e-422a-88a5-b9e48cac...
    b      Delayed('copy-01454a03-fd28-4fa5-b487-563ccc66...
D   b      Delayed('copy-f6cf94bd-d457-4495-bb2e-1db0152c...
dtype: object

I don't know how to call delayed objects from this and compute them.

Thank you so much in advance.

Solution

When working with delayed it's better to explicitly construct the list of delayed values, in your context this would be:

delayeds=[dask_test(group, 'val1') for _, group in df.groupby(['id', 'id2'])]

Then, the delayed values can be computed using dask.compute(*delayeds).