[SOLVED] Calculate weighted average using Pandas groupby and deal with NaN

Calculate weighted average using Pandas groupby and deal with NaN

I'm trying to calculate the weighted average using groupby. However, I'm dealing with empty value in my dataframe.

df = pd.DataFrame({
    'group': ['a', 'a', 'a', 'a'],
    'x': [10, 20, np.nan, 20],
    'weight_x': [10, 15, np.nan, 25],
    'y': [25, 35, 45, np.nan],
    'weight_y': [10, 20, 10, np.nan]
})

This is my groupby function:

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.average(x['x'], weights=x['weight_x']),
            np.average(x['y'], weights=x['weight_y'])
        ], index=['wt_avg_x', 'wt_avg_y'])
    )
    .reset_index()
)

This one gives the following output:

  group  wt_avg_x  wt_avg_y
0     a       NaN       NaN

However, expected output should be the following:

  group  wt_avg_x  wt_avg_y
0     a      20.5        35

I've tried this solution:

summary = (
    df.dropna(subset=['x', 'y', 'weight_x', 'weight_y'])
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.average(x['x'], weights=x['weight_x']),
            np.average(x['y'], weights=x['weight_y'])
        ], index=['wt_avg_x', 'wt_avg_y'])
    )
    .reset_index()
)

But dropna drops the whole line. How could I ignore nan values only in the np.average function?

Solution

You could also use the following:

def nan_average(x, cols = ['x', 'y']):
    vals = x[cols]
    weights = x[[f'weight_{c}' for c in cols]].to_numpy()
    return np.sum(vals*weights,0)/weights.sum(0,where=~np.isnan(vals.to_numpy()))

df.groupby('group').apply(nan_average).reset_index()
  group     x     y
0     a  18.0  35.0

Another way:

a = df.rename(columns = {'x':'val_x', 'y':'val_y'}).reset_index()

(pd.wide_to_long(a, ['val','weight'], ['index', 'group'], '', '_', '\\w+')
    .groupby(level=[1,2]).apply(lambda x: np.nansum(x.val * x.weight)/
            x.weight.to_numpy().sum(where=~np.isnan(x.val.to_numpy())))
    .unstack().reset_index())

  group     x     y
0     a  18.0  35.0