pythonpython-3.xpython-polarscumulative-sum

Compute cumulative mean & std on polars dataframe (using over)


I want to compute the cumulative mean & std on a polars dataframe column.

For the mean I tried this:

import polars as pl

df = pl.DataFrame({
    'value': [4, 6, 8, 11, 5, 6, 8, 15],
    'class': ['A', 'A', 'B', 'A', 'B', 'A', 'B', 'B']
})

df.with_columns(cum_mean=pl.col('value').cum_sum().over('class') 
                       / pl.int_range(pl.len()).add(1).over('class'))

which correctly gives

shape: (8, 3)
┌───────┬───────┬──────────┐
│ value ┆ class ┆ cum_mean │
│ ---   ┆ ---   ┆ ---      │
│ i64   ┆ str   ┆ f64      │
╞═══════╪═══════╪══════════╡
│ 4     ┆ A     ┆ 4.0      │
│ 6     ┆ A     ┆ 5.0      │
│ 8     ┆ B     ┆ 8.0      │
│ 11    ┆ A     ┆ 7.0      │
│ 5     ┆ B     ┆ 6.5      │
│ 6     ┆ A     ┆ 6.75     │
│ 8     ┆ B     ┆ 7.0      │
│ 15    ┆ B     ┆ 9.0      │
└───────┴───────┴──────────┘

However, this seems very clunky, and becomes a little more complicated (and possibly error-prone) for std.

Is there a nicer (possibly built-in) version for computing the cum mean & cum std?


Solution

  • I might have a solution which is more clean. You can get to it using rolling-functions like rolling_mean or rolling_std. Here is my proposal:

    df.with_columns(
        cum_mean=pl.col('value').cum_sum().over('class')/pl.col('value').cum_count().over('class'),
        cum_mean_by_rolling=pl.col('value').rolling_mean(window_size=df.shape[0], min_samples=1).over('class'),
        cum_std_by_rolling=pl.col('value').rolling_std(window_size=df.shape[0], min_samples=1).over('class')
    )
    

    If you define the window size as the number of rows in the data frame (df.shape[0]) and the minimum number of samples as 1, then you can get the wanted result. I also changed your implementation for the cum_mean a bit so that it is a bit shorter. If I run the code I get this result.

    shape: (8, 5)
    ┌───────┬───────┬──────────┬─────────────────────┬────────────────────┐
    │ value ┆ class ┆ cum_mean ┆ cum_mean_by_rolling ┆ cum_std_by_rolling │
    │ ---   ┆ ---   ┆ ---      ┆ ---                 ┆ ---                │
    │ i64   ┆ str   ┆ f64      ┆ f64                 ┆ f64                │
    ╞═══════╪═══════╪══════════╪═════════════════════╪════════════════════╡
    │ 4     ┆ A     ┆ 4.0      ┆ 4.0                 ┆ null               │
    │ 6     ┆ A     ┆ 5.0      ┆ 5.0                 ┆ 1.414214           │
    │ 8     ┆ B     ┆ 8.0      ┆ 8.0                 ┆ null               │
    │ 11    ┆ A     ┆ 7.0      ┆ 7.0                 ┆ 3.605551           │
    │ 5     ┆ B     ┆ 6.5      ┆ 6.5                 ┆ 2.12132            │
    │ 6     ┆ A     ┆ 6.75     ┆ 6.75                ┆ 2.986079           │
    │ 8     ┆ B     ┆ 7.0      ┆ 7.0                 ┆ 1.732051           │
    │ 15    ┆ B     ┆ 9.0      ┆ 9.0                 ┆ 4.242641           │
    └───────┴───────┴──────────┴─────────────────────┴────────────────────┘
    
    

    I did not find a more suitable build in function.

    Hope this helps.