pythonduckdb

Rolling sum using DuckDB's Python relational API


Say I have

data = {'id': [1, 1, 1, 2, 2, 2],
 'd': [1, 2, 3, 1, 2, 3],
 'sales': [1, 4, 2, 3, 1, 2]}

I want to compute a rolling sum with window of 2 partitioned by 'id' ordered by 'd'

Using SQL I can do:

duckdb.sql("""
select *, sum(sales) over w as rolling_sales
from df
window w as (partition by id order by d rows between 1 preceding and current row)
""")
Out[21]:
┌───────┬───────┬───────┬───────────────┐
│  id   │   d   │ sales │ rolling_sales │
│ int64 │ int64 │ int64 │    int128     │
├───────┼───────┼───────┼───────────────┤
│     1 │     1 │     1 │             1 │
│     1 │     2 │     4 │             5 │
│     1 │     3 │     2 │             6 │
│     2 │     1 │     3 │             3 │
│     2 │     2 │     1 │             4 │
│     2 │     3 │     2 │             3 │
└───────┴───────┴───────┴───────────────┘

This works great, but how can I do it using the Python Relational API?

I've got as far as

rel = duckdb.sql('select * from df')
rel.sum(
    'sales',
    projected_columns='*',
    window_spec='over (partition by id order by d rows between 1 preceding and current row)'
)

which gives

┌───────────────────────────────────────────────────────────────────────────────────────┐
│ sum(sales) OVER (PARTITION BY id ORDER BY d ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) │
│                                        int128                                         │
├───────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                     3 │
│                                                                                     4 │
│                                                                                     3 │
│                                                                                     1 │
│                                                                                     5 │
│                                                                                     6 │
└───────────────────────────────────────────────────────────────────────────────────────┘

This is close, but it's not quite right - how do I get the name of the last column to be rolling_sales?


Solution

  • I'm not an expert in DuckDB relational API but this works:

    rel.sum(
        'sales',
        projected_columns='*',
        window_spec='over (partition by id order by d rows between 1 preceding and current row) as rolling_sales'
    )
    
    ┌───────┬───────┬───────┬───────────────┐
    │  id   │   d   │ sales │ rolling_sales │
    │ int64 │ int64 │ int64 │    int128     │
    ├───────┼───────┼───────┼───────────────┤
    │     1 │     1 │     1 │             1 │
    │     1 │     2 │     4 │             5 │
    │     1 │     3 │     2 │             6 │
    │     2 │     1 │     3 │             3 │
    │     2 │     2 │     1 │             4 │
    │     2 │     3 │     2 │             3 │
    └───────┴───────┴───────┴───────────────┘