pythondataframegroup-byaggregatepython-polars

Multiple aggregations on multiple columns in Python polars


Checking out how to implement binning with Python polars, I can easily calculate aggregates for individual columns:

import polars as pl
import numpy as np

t, v = np.arange(0, 100, 2), np.arange(0, 100, 2)
df = pl.DataFrame({"t": t, "v0": v, "v1": v})
df = df.with_columns((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")

df.group_by_dynamic("datetime", every="10s").agg(pl.col("v0").mean())
shape: (10, 2)
┌─────────────────────┬──────┐
│ datetime            ┆ v0   │
│ ---                 ┆ ---  │
│ datetime[μs]        ┆ f64  │
╞═════════════════════╪══════╡
│ 2022-10-30 00:00:00 ┆ 4.0  │
│ 2022-10-30 00:00:10 ┆ 14.0 │
│ 2022-10-30 00:00:20 ┆ 24.0 │
│ 2022-10-30 00:00:30 ┆ 34.0 │
│ ...                 ┆ ...  │

or calculate multiple aggregations like

df.group_by_dynamic("datetime", every="10s").agg(
    pl.col("v0").mean().alias("v0_binmean"),
    pl.col("v0").count().alias("v0_bincount")
)
┌─────────────────────┬────────────┬─────────────┐
│ datetime            ┆ v0_binmean ┆ v0_bincount │
│ ---                 ┆ ---        ┆ ---         │
│ datetime[μs]        ┆ f64        ┆ u32         │
╞═════════════════════╪════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0        ┆ 5           │
│ 2022-10-30 00:00:10 ┆ 14.0       ┆ 5           │
│ 2022-10-30 00:00:20 ┆ 24.0       ┆ 5           │
│ 2022-10-30 00:00:30 ┆ 34.0       ┆ 5           │
│ ...                 ┆ ...        ┆ ...         │

or calculate one aggregation for multiple columns like

cols = [c for c in df.columns if "datetime" not in c]
df.group_by_dynamic("datetime", every="10s").agg(
     pl.col(f"{c}").mean().alias(f"{c}_binmean")
     for c in cols
)

┌─────────────────────┬────────────┬────────────┐
│ datetime            ┆ v0_binmean ┆ v1_binmean │
│ ---                 ┆ ---        ┆ ---        │
│ datetime[μs]        ┆ f64        ┆ f64        │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0        ┆ 4.0        │
│ 2022-10-30 00:00:10 ┆ 14.0       ┆ 14.0       │
│ 2022-10-30 00:00:20 ┆ 24.0       ┆ 24.0       │
│ 2022-10-30 00:00:30 ┆ 34.0       ┆ 34.0       │
│ ...                 ┆ ...        ┆ ...        │

However, combining both approaches fails!

df.group_by_dynamic("datetime", every="10s").agg(
    [
    pl.col(f"{c}").mean().alias(f"{c}_binmean"),
    pl.col(f"{c}").count().alias(f"{c}_bincount")
    ]
    for c in cols
)
DuplicateError: column with name 'literal' has more than one occurrences

Is there a "polarustic" approach to calculate multiple statistical parameters for multiple (all) columns of the dataframe in one go?

related, pandas-specific: Python pandas groupby aggregate on multiple columns


Solution

  • There are various ways of selecting multiple columns "at once" in polars:

    df.select(pl.all()).columns
    # ['v0', 'v1', 'datetime']
    
    df.select(pl.col("v0", "v1")).columns       # by name(s)
    # ['v0', 'v1']
    
    df.select(pl.exclude("datetime")).columns   # by exclusion
    # ['v0', 'v1']
    

    The output column names can be controlled using the .name.* methods

    e.g. name.suffix()

    df.select(pl.exclude("datetime").mean().name.suffix("_binmean"))
    
    shape: (1, 2)
    ┌────────────┬────────────┐
    │ v0_binmean ┆ v1_binmean │
    │ ---        ┆ ---        │
    │ f64        ┆ f64        │
    ╞════════════╪════════════╡
    │ 49.0       ┆ 49.0       │
    └────────────┴────────────┘
    

    As such, we can rewrite your example using:

    df.group_by_dynamic("datetime", every="10s").agg(
       pl.exclude("datetime").mean().name.suffix("_binmean"),
       pl.exclude("datetime").count().name.suffix("_bincount")
    )
    
    shape: (10, 5)
    ┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
    │ datetime            ┆ v0_binmean ┆ v1_binmean ┆ v0_bincount ┆ v1_bincount │
    │ ---                 ┆ ---        ┆ ---        ┆ ---         ┆ ---         │
    │ datetime[μs]        ┆ f64        ┆ f64        ┆ u32         ┆ u32         │
    ╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
    │ 2022-10-30 00:00:00 ┆ 4.0        ┆ 4.0        ┆ 5           ┆ 5           │
    │ 2022-10-30 00:00:10 ┆ 14.0       ┆ 14.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:00:20 ┆ 24.0       ┆ 24.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:00:30 ┆ 34.0       ┆ 34.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:00:40 ┆ 44.0       ┆ 44.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:00:50 ┆ 54.0       ┆ 54.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:01:00 ┆ 64.0       ┆ 64.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:01:10 ┆ 74.0       ┆ 74.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:01:20 ┆ 84.0       ┆ 84.0       ┆ 5           ┆ 5           │
    │ 2022-10-30 00:01:30 ┆ 94.0       ┆ 94.0       ┆ 5           ┆ 5           │
    └─────────────────────┴────────────┴────────────┴─────────────┴─────────────┘