python-polars

Apply function accross multiple columns within group_by in Polars


Given this dataframe:

polars_df = pl.DataFrame({
    "name": ["A","B","C"],
    "group": ["a","a","b"],
    "val1": [1, None, 3],
    "val2": [1, 5, None],
    "val3": [None, None, 3],
})

I want to calculate the mean and count the number of NAs within the three val* columns for each group. So the result should look like:

pl.DataFrame([
    {'group': 'a', 'mean': 2.0, 'percentage_na': 0.5},
    {'group': 'b', 'mean': 3.0, 'percentage_na': 0.3333333333333333}
])

In Pandas I was able to do this with this (quite ugly and not optimized) code:

df = polars_df.to_pandas()

pd.concat([
    df.groupby(["group"]).apply(lambda g: g.filter(like="val").mean().mean()).rename("mean"),
    df.groupby(["group"]).apply(lambda g: g.filter(like="val").isna().sum().sum() / (g.filter(like="val").shape[0] * g.filter(like="val").shape[1])).rename("percentage_na")
], axis=1)

Solution

  • I rolled back my answer to when the answer is 2.33

    all_cols_except_val=[x for x in df.columns if "val" not in x]
    df.unpivot(index=all_cols_except_val) \
        .group_by('group') \
        .agg(
            mean=pl.col('value').mean(),
            percent_na=pl.col('value').is_null().sum()/pl.col('value').count()
        )
    
    
    shape: (2, 3)
    ┌───────┬──────────┬────────────┐
    │ group ┆ mean     ┆ percent_na │
    │ ---   ┆ ---      ┆ ---        │
    │ str   ┆ f64      ┆ f64        │
    ╞═══════╪══════════╪════════════╡
    │ b     ┆ 3.0      ┆ 0.333333   │
    │ a     ┆ 2.333333 ┆ 0.5        │
    └───────┴──────────┴────────────┘