pythonduckdb

Minimum periods in rolling mean


Say I have:

data = {
    'id': ['a', 'a', 'a', 'b', 'b', 'b', 'b'],
    'd': [1,2,3,0,1,2,3],
    'sales': [5,1,3,4,1,2,3],
}

I would like to add a column with a rolling mean with window size 2, with min_periods=2, over 'id'

In Polars, I can do:

import polars as pl

df = pl.DataFrame(data)
df.with_columns(sales_rolling = pl.col('sales').rolling_mean(2).over('id'))
shape: (7, 4)
┌─────┬─────┬───────┬───────────────┐
│ id  ┆ d   ┆ sales ┆ sales_rolling │
│ --- ┆ --- ┆ ---   ┆ ---           │
│ str ┆ i64 ┆ i64   ┆ f64           │
╞═════╪═════╪═══════╪═══════════════╡
│ a   ┆ 1   ┆ 5     ┆ null          │
│ a   ┆ 2   ┆ 1     ┆ 3.0           │
│ a   ┆ 3   ┆ 3     ┆ 2.0           │
│ b   ┆ 0   ┆ 4     ┆ null          │
│ b   ┆ 1   ┆ 1     ┆ 2.5           │
│ b   ┆ 2   ┆ 2     ┆ 1.5           │
│ b   ┆ 3   ┆ 3     ┆ 2.5           │
└─────┴─────┴───────┴───────────────┘

What's the DuckDB equivalent? I've tried

import duckdb

duckdb.sql("""
    select
        *,
        mean(sales) over (
            partition by id 
            order by d
            range between 1 preceding and 0 following
        ) as sales_rolling 
    from df
""").sort('id', 'd')

but get

┌─────────┬───────┬───────┬───────────────┐
│   id    │   d   │ sales │ sales_rolling │
│ varchar │ int64 │ int64 │    double     │
├─────────┼───────┼───────┼───────────────┤
│ a       │     1 │     5 │           5.0 │
│ a       │     2 │     1 │           3.0 │
│ a       │     3 │     3 │           2.0 │
│ b       │     0 │     4 │           4.0 │
│ b       │     1 │     1 │           2.5 │
│ b       │     2 │     2 │           1.5 │
│ b       │     3 │     3 │           2.5 │
└─────────┴───────┴───────┴───────────────┘

This is very close, but duckdb still calculates the rolling mean when there's only a single value in the window. How can I replicate the min_periods=2 (default) behaviour from Polars?


Solution

  • You can use case statement and count:

    duckdb.sql("""
        from df
        select
            *,
            case
                when count(*) over rolling2 = 2 then 
                    mean(sales) over rolling2
            end as sales_rolling
        window rolling2 as (
            partition by id 
            order by d
            rows between 1 preceding and current row
        )   
    """).sort('id', 'd')
    
    ┌─────────┬───────┬───────┬───────────────┐
    │   id    │   d   │ sales │ sales_rolling │
    │ varchar │ int64 │ int64 │    double     │
    ├─────────┼───────┼───────┼───────────────┤
    │ a       │     1 │     5 │          NULL │
    │ a       │     2 │     1 │           3.0 │
    │ a       │     3 │     3 │           2.0 │
    │ b       │     0 │     4 │          NULL │
    │ b       │     1 │     1 │           2.5 │
    │ b       │     2 │     2 │           1.5 │
    │ b       │     3 │     3 │           2.5 │
    └─────────┴───────┴───────┴───────────────┘
    

    Note I've used named window and row framing here.