Say I have
data = {'id': [1, 1, 1, 2, 2, 2],
'd': [1, 2, 3, 1, 2, 3],
'sales': [1, 4, 2, 3, 1, 2]}
I want to compute a rolling sum with window of 2 partitioned by 'id' ordered by 'd'
Using SQL I can do:
duckdb.sql("""
select *, sum(sales) over w as rolling_sales
from df
window w as (partition by id order by d rows between 1 preceding and current row)
""")
Out[21]:
┌───────┬───────┬───────┬───────────────┐
│ id │ d │ sales │ rolling_sales │
│ int64 │ int64 │ int64 │ int128 │
├───────┼───────┼───────┼───────────────┤
│ 1 │ 1 │ 1 │ 1 │
│ 1 │ 2 │ 4 │ 5 │
│ 1 │ 3 │ 2 │ 6 │
│ 2 │ 1 │ 3 │ 3 │
│ 2 │ 2 │ 1 │ 4 │
│ 2 │ 3 │ 2 │ 3 │
└───────┴───────┴───────┴───────────────┘
This works great, but how can I do it using the Python Relational API?
I've got as far as
rel = duckdb.sql('select * from df')
rel.sum(
'sales',
projected_columns='*',
window_spec='over (partition by id order by d rows between 1 preceding and current row)'
)
which gives
┌───────────────────────────────────────────────────────────────────────────────────────┐
│ sum(sales) OVER (PARTITION BY id ORDER BY d ROWS BETWEEN 1 PRECEDING AND CURRENT ROW) │
│ int128 │
├───────────────────────────────────────────────────────────────────────────────────────┤
│ 3 │
│ 4 │
│ 3 │
│ 1 │
│ 5 │
│ 6 │
└───────────────────────────────────────────────────────────────────────────────────────┘
This is close, but it's not quite right - how do I get the name of the last column to be rolling_sales
?
I'm not an expert in DuckDB relational API but this works:
rel.sum(
'sales',
projected_columns='*',
window_spec='over (partition by id order by d rows between 1 preceding and current row) as rolling_sales'
)
┌───────┬───────┬───────┬───────────────┐
│ id │ d │ sales │ rolling_sales │
│ int64 │ int64 │ int64 │ int128 │
├───────┼───────┼───────┼───────────────┤
│ 1 │ 1 │ 1 │ 1 │
│ 1 │ 2 │ 4 │ 5 │
│ 1 │ 3 │ 2 │ 6 │
│ 2 │ 1 │ 3 │ 3 │
│ 2 │ 2 │ 1 │ 4 │
│ 2 │ 3 │ 2 │ 3 │
└───────┴───────┴───────┴───────────────┘