What's the best way to apply a custom function to multiple columns in Polars? Specifically I need the function to reference another column in the dataframe. Say I have the following:
df = pl.DataFrame({
'group': [1,1,2,2],
'other': ['a', 'b', 'a', 'b'],
'num_obs': [10, 5, 20, 10],
'x': [1,2,3,4],
'y': [5,6,7,8],
})
And I want to group by group
and calculate an average of x
and y
, weighted by num_obs
. I can do something like this:
variables = ['x', 'y']
df.group_by('group').agg((pl.col(var) * pl.col('num_obs')).sum()/pl.col('num_obs').sum() for var in variables)
but I'm wondering if there's a better way. Also, I don't know how to add other aggregations to this approach, but is there a way that I could also add pl.sum('n_obs')
?
You can just pass list of columns into pl.col()
:
df.group_by('group').agg(
(pl.col('x','y') * pl.col('num_obs')).sum() / pl.col('num_obs').sum(),
pl.col('num_obs').sum()
)
┌───────┬──────────┬──────────┬─────────┐
│ group ┆ x ┆ y ┆ num_obs │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ i64 │
╞═══════╪══════════╪══════════╪═════════╡
│ 1 ┆ 1.333333 ┆ 5.333333 ┆ 15 │
│ 2 ┆ 3.333333 ┆ 7.333333 ┆ 30 │
└───────┴──────────┴──────────┴─────────┘