pandas-groupbydaskdask-delayed

dask delayed functions on pandas groupby objects


I couldn't figure out how to compute delayed objects coming from df.groupy.apply() operation. I really appreciate if someone can help. Here is a sample code I wrote

import pandas as pd
import dask
df = pd.DataFrame(columns=['id','id2','val1'])
df['id'] = ['A','A','A','B','C','C','D','D']
df['id2']=['a','a','b','a','a','b','b','b']
df['val1']= [1,2,3,4,5,6,7,8]
@dask.delayed
def dask_test(group,val_col):
    for idx,row in group.iterrows():
        group.loc[idx,'test']=2*group.loc[idx,val_col]
    return group

tmp_grp = df.groupby(['id','id2']).apply(dask_test,'val1')

The output of tmp_grp is

id  id2
A   a      Delayed('copy-f0e26845-fc3a-4bb7-8609-47b923c0...
    b      Delayed('copy-9b6cecf5-9fa4-4301-ba2d-dec5478d...
B   a      Delayed('copy-7b538f4b-ac3f-4c83-b37b-e620d0ba...
C   a      Delayed('copy-c722fa78-c46e-422a-88a5-b9e48cac...
    b      Delayed('copy-01454a03-fd28-4fa5-b487-563ccc66...
D   b      Delayed('copy-f6cf94bd-d457-4495-bb2e-1db0152c...
dtype: object

I don't know how to call delayed objects from this and compute them.

Thank you so much in advance.


Solution

  • When working with delayed it's better to explicitly construct the list of delayed values, in your context this would be:

    delayeds=[dask_test(group, 'val1') for _, group in df.groupby(['id', 'id2'])]
    

    Then, the delayed values can be computed using dask.compute(*delayeds).