I'm trying to calculate the weighted average using groupby. However, I'm dealing with empty value in my dataframe.
df = pd.DataFrame({
'group': ['a', 'a', 'a', 'a'],
'x': [10, 20, np.nan, 20],
'weight_x': [10, 15, np.nan, 25],
'y': [25, 35, 45, np.nan],
'weight_y': [10, 20, 10, np.nan]
})
This is my groupby function:
summary = (
df
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.average(x['x'], weights=x['weight_x']),
np.average(x['y'], weights=x['weight_y'])
], index=['wt_avg_x', 'wt_avg_y'])
)
.reset_index()
)
This one gives the following output:
group wt_avg_x wt_avg_y
0 a NaN NaN
However, expected output should be the following:
group wt_avg_x wt_avg_y
0 a 20.5 35
I've tried this solution:
summary = (
df.dropna(subset=['x', 'y', 'weight_x', 'weight_y'])
.groupby(['group'])
.apply(
lambda x: pd.Series([
np.average(x['x'], weights=x['weight_x']),
np.average(x['y'], weights=x['weight_y'])
], index=['wt_avg_x', 'wt_avg_y'])
)
.reset_index()
)
But dropna
drops the whole line. How could I ignore nan
values only in the np.average
function?
You could also use the following:
def nan_average(x, cols = ['x', 'y']):
vals = x[cols]
weights = x[[f'weight_{c}' for c in cols]].to_numpy()
return np.sum(vals*weights,0)/weights.sum(0,where=~np.isnan(vals.to_numpy()))
df.groupby('group').apply(nan_average).reset_index()
group x y
0 a 18.0 35.0
Another way:
a = df.rename(columns = {'x':'val_x', 'y':'val_y'}).reset_index()
(pd.wide_to_long(a, ['val','weight'], ['index', 'group'], '', '_', '\\w+')
.groupby(level=[1,2]).apply(lambda x: np.nansum(x.val * x.weight)/
x.weight.to_numpy().sum(where=~np.isnan(x.val.to_numpy())))
.unstack().reset_index())
group x y
0 a 18.0 35.0