I want to calculate the weighted average of a polars dataframe during a groupby operation. If the sum of the weights within the groupby equal zero I want to return the equally weighted mean. Additionally, I want to ignore null values in the columns to average or the weight columns.
Notation:
df (polars.DataFrame): input dataframe
gr (list[str]): columns to groupby
vr (list[str]): columns to take average of
wt (str): column to use as the weights
To calculate the equally weighted average I have
res = df.group_by(gr, maintain_order = True).agg(polars.col(vr).mean())
By default mean function ignores null values in the 'vr' columns, which is desired.
To calculate the weighted average I have
mask_wt = polars.col(wt) * polars.col(vr).is_not_null()
wavg = (polars.col(vr) * polars.col(wt)).sum() / mask_wt.sum()
res = df.group_by(gr, maintain_order = True).agg(wavg)
Masking the weights allows me to ignore null values in 'vr' columns and the 'wt' column.
For example,
import polars
df = polars.DataFrame({'id': ['x', 'x', 'y', 'y'],
'dt': ['j', 'f', 'j', 'f'],
'a': [None, 2, 3, 4],
'b': [5, 6, 7, 8],
'wt': [1, 0, 0.25, 0.25]})
gr = ['id']
vr = ['a', 'b']
wt = 'wt'
>>> df
┌─────┬─────┬──────┬─────┬──────┐
│ id ┆ dt ┆ a ┆ b ┆ wt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ i64 ┆ f64 │
╞═════╪═════╪══════╪═════╪══════╡
│ x ┆ j ┆ null ┆ 5 ┆ 1.0 │
│ x ┆ f ┆ 2 ┆ 6 ┆ 0.0 │
│ y ┆ j ┆ 3 ┆ 7 ┆ 0.25 │
│ y ┆ f ┆ 4 ┆ 8 ┆ 0.25 │
└─────┴─────┴──────┴─────┴──────┘
>>> df.group_by(gr, maintain_order = True).agg(polars.col(vr).mean())
┌─────┬─────┬─────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ x ┆ 2.0 ┆ 5.5 │
│ y ┆ 3.5 ┆ 7.5 │
└─────┴─────┴─────┘
>>> mask_wt = polars.col(wt) * polars.col(vr).is_not_null()
>>> wavg = (polars.col(vr) * polars.col(wt)).sum() / mask_wt.sum()
>>> res = df.group_by(gr, maintain_order = True).agg(wavg)
┌─────┬─────┬─────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ x ┆ NaN ┆ 5.0 │
│ y ┆ 3.5 ┆ 7.5 │
└─────┴─────┴─────┘
I would like the NaN in the weighted average to be replaced with the 2 from the equally weighted average because when ignoring the weight for the null value in the column 'a' the sum of the weights is 0, so the equally weighted average should be returned.
Using pandas I can accomplish this by
import pandas
def _wavg_py(gr, vr, wt):
x = gr[[vr, wt]].dropna()
den = x[wt].sum()
if(den == 0):
return(gr[vr].mean())
else:
return((x[vr] * x[wt]).sum() / den)
res = df.groupby(by = gr).apply(_wavg_py, col, wt)
I'm not sure if it is an acceptable approach to
.fill_nan()
with the .mean()
df.group_by(gr, maintain_order=True).agg(
wavg.fill_nan(pl.col(vr).mean())
)
shape: (2, 3)
┌─────┬─────┬─────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ x ┆ 2.0 ┆ 5.0 │
│ y ┆ 3.5 ┆ 7.5 │
└─────┴─────┴─────┘
Alternatively, it seems you're asking how to perform conditional logic inside expressions
i.e.
when/then/otherwise
df.group_by(gr, maintain_order=True).agg(
pl.when(pl.col('wt').filter(pl.col(vr).is_not_null()).sum() == 0)
.then(pl.col(vr).mean())
.otherwise(wavg)
)
shape: (2, 3)
┌─────┬─────┬─────┐
│ id ┆ a ┆ b │
│ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 │
╞═════╪═════╪═════╡
│ x ┆ 2.0 ┆ 5.0 │
│ y ┆ 3.5 ┆ 7.5 │
└─────┴─────┴─────┘