Here is a script to simulate the issue I am facing:
import pandas as pd
import numpy as np
data = {
'a':[1,2,1,1,2,1,1],
'b':[10,40,20,10,40,10,20],
'c':[0.3, 0.2, 0.6, 0.4, 0.5, 0.2, 0.8],
'd':[3, 1, 5, 1, 7, 2., 2.],
}
df = pd.DataFrame.from_dict(data)
# I apply some custom function to populate column 'e'.
# For demonstration, I am using a very simple function here.
df['e']=df.apply(lambda x: x['c']<=0.3, axis=1)
# This is the column I need to obtain using groupby and pipe/transform
df['f']=[2., 1., 0., 2., 1., 2., 0.]
print(df)
Output:
a b c d e f
0 1 10 0.3 3.0 True 2.0
1 2 40 0.2 1.0 True 1.0
2 1 20 0.6 5.0 False 0.0
3 1 10 0.4 1.0 False 2.0
4 2 40 0.5 7.0 False 1.0
5 1 10 0.2 2.0 True 2.0
6 1 20 0.8 2.0 False 0.0
The logic to be used to find column f
is as follows:
For each group of df.groupby(['a', 'b'])
:
e
.d
and return d
(in real application, d
needs to be manipulated in conjunction with other columns, and the result would be returned)What I have tried:
def func(x):
print(type(x))
print(x)
print('-'*50)
ind=np.where(x['e']) #<--- How can I implement this?
if len(ind)>0:
ind_min=np.argmin(x.iloc[ind]['d'])
return x.iloc[ind[ind_min]]['d']
else:
return 0
df['g']=df.groupby(['a', 'b']).pipe(func)
Output:
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B348735550>
--------------------------------------------------
...
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (3, 2) + inhomogeneous part.
I get the mentioned error on line: ind=np.where(x['e']) #<--- How can I implement this?
So, how can apply np.where
on a pandas.core.groupby.generic.DataFrameGroupBy
object?
You're presenting the XY problem. Here's one approach:
cond = df['c'] <= 0.3
df['f'] = (
df.assign(filtered_d=df['d'].where(cond))
.groupby(['a', 'b'])['filtered_d']
.transform('min')
.fillna(0)
)
Output:
a b c d e f
0 1 10 0.3 3.0 True 2.0
1 2 40 0.2 1.0 True 1.0
2 1 20 0.6 5.0 False 0.0
3 1 10 0.4 1.0 False 2.0
4 2 40 0.5 7.0 False 1.0
5 1 10 0.2 2.0 True 2.0
6 1 20 0.8 2.0 False 0.0
Explanation / Intermediate
Series.where
to column 'd' based on the boolean series to keep only the values that we want to consider for min
:# df['d'].where(cond)
0 3.0
1 1.0
2 NaN
3 NaN
4 NaN
5 2.0
6 NaN
Name: d, dtype: float64
df.assign
to include the result (here as filtered_d
) and apply df.groupby
+ groupby.transform
to get 'min'
.Series.fillna
and assign as a new column.An alternative way to do this could be:
cond = df['c'] <= 0.3
df['f'] = (
df.merge(
df[cond]
.groupby(['a', 'b'], as_index=False)
.agg(f=('d', 'min')),
on=['a', 'b'],
how='left'
).assign(f=lambda x: x['f'].fillna(0))
)
Explanation
df.groupby
+ groupby.agg
+ named aggregation.df
via df.merge
+ chain df.assign
to apply Series.fillna
.