pythondataframepython-polars

Polars apply same custom function to multiple columns in group by


What's the best way to apply a custom function to multiple columns in Polars? Specifically I need the function to reference another column in the dataframe. Say I have the following:

df = pl.DataFrame({
    'group': [1,1,2,2],
    'other': ['a', 'b', 'a', 'b'],
    'num_obs': [10, 5, 20, 10],
    'x': [1,2,3,4],
    'y': [5,6,7,8],
})

And I want to group by group and calculate an average of x and y, weighted by num_obs. I can do something like this:

variables = ['x', 'y']
df.group_by('group').agg((pl.col(var) * pl.col('num_obs')).sum()/pl.col('num_obs').sum() for var in variables)

but I'm wondering if there's a better way. Also, I don't know how to add other aggregations to this approach, but is there a way that I could also add pl.sum('n_obs')?


Solution

  • You can just pass list of columns into pl.col():

    df.group_by('group').agg(
        (pl.col('x','y') * pl.col('num_obs')).sum() / pl.col('num_obs').sum(),
        pl.col('num_obs').sum()
    )
    
    ┌───────┬──────────┬──────────┬─────────┐
    │ group ┆ x        ┆ y        ┆ num_obs │
    │ ---   ┆ ---      ┆ ---      ┆ ---     │
    │ i64   ┆ f64      ┆ f64      ┆ i64     │
    ╞═══════╪══════════╪══════════╪═════════╡
    │ 1     ┆ 1.333333 ┆ 5.333333 ┆ 15      │
    │ 2     ┆ 3.333333 ┆ 7.333333 ┆ 30      │
    └───────┴──────────┴──────────┴─────────┘