I couldn't figure out how to compute delayed objects coming from df.groupy.apply()
operation. I really appreciate if someone can help. Here is a sample code I wrote
import pandas as pd
import dask
df = pd.DataFrame(columns=['id','id2','val1'])
df['id'] = ['A','A','A','B','C','C','D','D']
df['id2']=['a','a','b','a','a','b','b','b']
df['val1']= [1,2,3,4,5,6,7,8]
@dask.delayed
def dask_test(group,val_col):
for idx,row in group.iterrows():
group.loc[idx,'test']=2*group.loc[idx,val_col]
return group
tmp_grp = df.groupby(['id','id2']).apply(dask_test,'val1')
The output of tmp_grp is
id id2
A a Delayed('copy-f0e26845-fc3a-4bb7-8609-47b923c0...
b Delayed('copy-9b6cecf5-9fa4-4301-ba2d-dec5478d...
B a Delayed('copy-7b538f4b-ac3f-4c83-b37b-e620d0ba...
C a Delayed('copy-c722fa78-c46e-422a-88a5-b9e48cac...
b Delayed('copy-01454a03-fd28-4fa5-b487-563ccc66...
D b Delayed('copy-f6cf94bd-d457-4495-bb2e-1db0152c...
dtype: object
I don't know how to call delayed objects from this and compute them.
Thank you so much in advance.
When working with delayed
it's better to explicitly construct the list of delayed values, in your context this would be:
delayeds=[dask_test(group, 'val1') for _, group in df.groupby(['id', 'id2'])]
Then, the delayed values can be computed using dask.compute(*delayeds)
.