I want to compute the cumulative mean & std on a polars dataframe column.
For the mean
I tried this:
import polars as pl
df = pl.DataFrame({
'value': [4, 6, 8, 11, 5, 6, 8, 15],
'class': ['A', 'A', 'B', 'A', 'B', 'A', 'B', 'B']
})
df.with_columns(cum_mean=pl.col('value').cum_sum().over('class')
/ pl.int_range(pl.len()).add(1).over('class'))
which correctly gives
shape: (8, 3)
┌───────┬───────┬──────────┐
│ value ┆ class ┆ cum_mean │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 │
╞═══════╪═══════╪══════════╡
│ 4 ┆ A ┆ 4.0 │
│ 6 ┆ A ┆ 5.0 │
│ 8 ┆ B ┆ 8.0 │
│ 11 ┆ A ┆ 7.0 │
│ 5 ┆ B ┆ 6.5 │
│ 6 ┆ A ┆ 6.75 │
│ 8 ┆ B ┆ 7.0 │
│ 15 ┆ B ┆ 9.0 │
└───────┴───────┴──────────┘
However, this seems very clunky, and becomes a little more complicated (and possibly error-prone) for std
.
Is there a nicer (possibly built-in) version for computing the cum mean & cum std?
I might have a solution which is more clean. You can get to it using rolling-functions like rolling_mean
or rolling_std
. Here is my proposal:
df.with_columns(
cum_mean=pl.col('value').cum_sum().over('class')/pl.col('value').cum_count().over('class'),
cum_mean_by_rolling=pl.col('value').rolling_mean(window_size=df.shape[0], min_samples=1).over('class'),
cum_std_by_rolling=pl.col('value').rolling_std(window_size=df.shape[0], min_samples=1).over('class')
)
If you define the window size as the number of rows in the data frame (df.shape[0]
) and the minimum number of samples as 1, then you can get the wanted result.
I also changed your implementation for the cum_mean a bit so that it is a bit shorter. If I run the code I get this result.
shape: (8, 5)
┌───────┬───────┬──────────┬─────────────────────┬────────────────────┐
│ value ┆ class ┆ cum_mean ┆ cum_mean_by_rolling ┆ cum_std_by_rolling │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ f64 ┆ f64 ┆ f64 │
╞═══════╪═══════╪══════════╪═════════════════════╪════════════════════╡
│ 4 ┆ A ┆ 4.0 ┆ 4.0 ┆ null │
│ 6 ┆ A ┆ 5.0 ┆ 5.0 ┆ 1.414214 │
│ 8 ┆ B ┆ 8.0 ┆ 8.0 ┆ null │
│ 11 ┆ A ┆ 7.0 ┆ 7.0 ┆ 3.605551 │
│ 5 ┆ B ┆ 6.5 ┆ 6.5 ┆ 2.12132 │
│ 6 ┆ A ┆ 6.75 ┆ 6.75 ┆ 2.986079 │
│ 8 ┆ B ┆ 7.0 ┆ 7.0 ┆ 1.732051 │
│ 15 ┆ B ┆ 9.0 ┆ 9.0 ┆ 4.242641 │
└───────┴───────┴──────────┴─────────────────────┴────────────────────┘
I did not find a more suitable build in function.
Hope this helps.