pythonpandasnumpygroup-by

Calculate weighted average using Pandas groupby and deal with NaN


I'm trying to calculate the weighted average using groupby. However, I'm dealing with empty value in my dataframe.

df = pd.DataFrame({
    'group': ['a', 'a', 'a', 'a'],
    'x': [10, 20, np.nan, 20],
    'weight_x': [10, 15, np.nan, 25],
    'y': [25, 35, 45, np.nan],
    'weight_y': [10, 20, 10, np.nan]
})

This is my groupby function:

summary = (
    df
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.average(x['x'], weights=x['weight_x']),
            np.average(x['y'], weights=x['weight_y'])
        ], index=['wt_avg_x', 'wt_avg_y'])
    )
    .reset_index()
)

This one gives the following output:

  group  wt_avg_x  wt_avg_y
0     a       NaN       NaN

However, expected output should be the following:

  group  wt_avg_x  wt_avg_y
0     a      20.5        35

I've tried this solution:

summary = (
    df.dropna(subset=['x', 'y', 'weight_x', 'weight_y'])
    .groupby(['group'])
    .apply(
        lambda x: pd.Series([
            np.average(x['x'], weights=x['weight_x']),
            np.average(x['y'], weights=x['weight_y'])
        ], index=['wt_avg_x', 'wt_avg_y'])
    )
    .reset_index()
)

But dropna drops the whole line. How could I ignore nan values only in the np.average function?


Solution

  • You could also use the following:

    def nan_average(x, cols = ['x', 'y']):
        vals = x[cols]
        weights = x[[f'weight_{c}' for c in cols]].to_numpy()
        return np.sum(vals*weights,0)/weights.sum(0,where=~np.isnan(vals.to_numpy()))
    
    df.groupby('group').apply(nan_average).reset_index()
      group     x     y
    0     a  18.0  35.0
    

    Another way:

    a = df.rename(columns = {'x':'val_x', 'y':'val_y'}).reset_index()
    
    (pd.wide_to_long(a, ['val','weight'], ['index', 'group'], '', '_', '\\w+')
        .groupby(level=[1,2]).apply(lambda x: np.nansum(x.val * x.weight)/
                x.weight.to_numpy().sum(where=~np.isnan(x.val.to_numpy())))
        .unstack().reset_index())
    
      group     x     y
    0     a  18.0  35.0