Checking out how to implement binning with Python polars, I can easily calculate aggregates for individual columns:
import polars as pl
import numpy as np
t, v = np.arange(0, 100, 2), np.arange(0, 100, 2)
df = pl.DataFrame({"t": t, "v0": v, "v1": v})
df = df.with_columns((pl.datetime(2022,10,30) + pl.duration(seconds=df["t"])).alias("datetime")).drop("t")
df.group_by_dynamic("datetime", every="10s").agg(pl.col("v0").mean())
shape: (10, 2)
┌─────────────────────┬──────┐
│ datetime ┆ v0 │
│ --- ┆ --- │
│ datetime[μs] ┆ f64 │
╞═════════════════════╪══════╡
│ 2022-10-30 00:00:00 ┆ 4.0 │
│ 2022-10-30 00:00:10 ┆ 14.0 │
│ 2022-10-30 00:00:20 ┆ 24.0 │
│ 2022-10-30 00:00:30 ┆ 34.0 │
│ ... ┆ ... │
or calculate multiple aggregations like
df.group_by_dynamic("datetime", every="10s").agg(
pl.col("v0").mean().alias("v0_binmean"),
pl.col("v0").count().alias("v0_bincount")
)
┌─────────────────────┬────────────┬─────────────┐
│ datetime ┆ v0_binmean ┆ v0_bincount │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ u32 │
╞═════════════════════╪════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 5 │
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 5 │
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 5 │
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 5 │
│ ... ┆ ... ┆ ... │
or calculate one aggregation for multiple columns like
cols = [c for c in df.columns if "datetime" not in c]
df.group_by_dynamic("datetime", every="10s").agg(
pl.col(f"{c}").mean().alias(f"{c}_binmean")
for c in cols
)
┌─────────────────────┬────────────┬────────────┐
│ datetime ┆ v0_binmean ┆ v1_binmean │
│ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 │
╞═════════════════════╪════════════╪════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 4.0 │
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 14.0 │
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 24.0 │
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 34.0 │
│ ... ┆ ... ┆ ... │
However, combining both approaches fails!
df.group_by_dynamic("datetime", every="10s").agg(
[
pl.col(f"{c}").mean().alias(f"{c}_binmean"),
pl.col(f"{c}").count().alias(f"{c}_bincount")
]
for c in cols
)
DuplicateError: column with name 'literal' has more than one occurrences
Is there a "polarustic" approach to calculate multiple statistical parameters for multiple (all) columns of the dataframe in one go?
related, pandas-specific: Python pandas groupby aggregate on multiple columns
There are various ways of selecting multiple columns "at once" in polars:
df.select(pl.all()).columns
# ['v0', 'v1', 'datetime']
df.select(pl.col("v0", "v1")).columns # by name(s)
# ['v0', 'v1']
df.select(pl.exclude("datetime")).columns # by exclusion
# ['v0', 'v1']
The output column names can be controlled using the .name.*
methods
e.g. name.suffix()
df.select(pl.exclude("datetime").mean().name.suffix("_binmean"))
shape: (1, 2)
┌────────────┬────────────┐
│ v0_binmean ┆ v1_binmean │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞════════════╪════════════╡
│ 49.0 ┆ 49.0 │
└────────────┴────────────┘
As such, we can rewrite your example using:
df.group_by_dynamic("datetime", every="10s").agg(
pl.exclude("datetime").mean().name.suffix("_binmean"),
pl.exclude("datetime").count().name.suffix("_bincount")
)
shape: (10, 5)
┌─────────────────────┬────────────┬────────────┬─────────────┬─────────────┐
│ datetime ┆ v0_binmean ┆ v1_binmean ┆ v0_bincount ┆ v1_bincount │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ datetime[μs] ┆ f64 ┆ f64 ┆ u32 ┆ u32 │
╞═════════════════════╪════════════╪════════════╪═════════════╪═════════════╡
│ 2022-10-30 00:00:00 ┆ 4.0 ┆ 4.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:00:10 ┆ 14.0 ┆ 14.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:00:20 ┆ 24.0 ┆ 24.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:00:30 ┆ 34.0 ┆ 34.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:00:40 ┆ 44.0 ┆ 44.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:00:50 ┆ 54.0 ┆ 54.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:01:00 ┆ 64.0 ┆ 64.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:01:10 ┆ 74.0 ┆ 74.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:01:20 ┆ 84.0 ┆ 84.0 ┆ 5 ┆ 5 │
│ 2022-10-30 00:01:30 ┆ 94.0 ┆ 94.0 ┆ 5 ┆ 5 │
└─────────────────────┴────────────┴────────────┴─────────────┴─────────────┘