pythonpostgresqlpython-polarsduckdb

rolling sum with right-closed interval in duckdb


In Polars / pandas I can do a rolling sum where row each row the window is (row - 10 minutes, row]. For example:

import polars as pl

data = {
    "timestamp": [
        "2023-08-04 10:00:00",
        "2023-08-04 10:05:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:10:00",
        "2023-08-04 10:20:00",
        "2023-08-04 10:20:00",
    ],
    "value": [1, 2, 3, 4, 5, 6],
}

df = pl.DataFrame(data).with_columns(pl.col("timestamp").str.strptime(pl.Datetime))

print(
    df.with_columns(pl.col("value").rolling_sum_by("timestamp", "10m", closed="right"))
)

This outputs

shape: (6, 2)
┌─────────────────────┬───────┐
│ timestamp           ┆ value │
│ ---                 ┆ ---   │
│ datetime[μs]        ┆ i64   │
╞═════════════════════╪═══════╡
│ 2023-08-04 10:00:00 ┆ 1     │
│ 2023-08-04 10:05:00 ┆ 3     │
│ 2023-08-04 10:10:00 ┆ 9     │
│ 2023-08-04 10:10:00 ┆ 9     │
│ 2023-08-04 10:20:00 ┆ 11    │
│ 2023-08-04 10:20:00 ┆ 11    │
└─────────────────────┴───────┘

How can I do this in DuckDB? Closest I could come up with is:

rel = duckdb.sql("""
SELECT
  timestamp,
  value,
  SUM(value) OVER roll AS rolling_sum
FROM df
WINDOW roll AS (
  ORDER BY timestamp
  RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND CURRENT ROW
)
ORDER BY timestamp;
""")
print(rel)

but that makes the window [row - 10 minutes, row], not (row - 10 minutes, row]

Alternatively, I could do

rel = duckdb.sql("""
SELECT
  timestamp,
  value,
  SUM(value) OVER roll AS rolling_sum
FROM df
WINDOW roll AS (
  ORDER BY timestamp
  RANGE BETWEEN INTERVAL '10 minutes' - INTERVAL '1 microsecond' PRECEDING AND CURRENT ROW
)
ORDER BY timestamp;
""")

but I'm not sure about how robust that'd be?


Solution

  • Maybe not particularly neat, but from the top of my head you could exclude the rows which are exactly 10 minutes back by additional window clause

    import duckdb
    
    rel = duckdb.sql("""
    SELECT
      timestamp,
      value,
      SUM(value) OVER roll - coalesce(SUM(value) OVER exclude, 0) AS rolling_sum
    FROM df
    WINDOW roll AS (
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND CURRENT ROW
    ), exclude AS (
      ORDER BY timestamp
      RANGE BETWEEN INTERVAL 10 minutes PRECEDING AND INTERVAL 10 minutes PRECEDING
    )
    ORDER BY timestamp;
    """)
    print(rel)
    
    ┌─────────────────────┬───────┬─────────────┐
    │      timestamp      │ value │ rolling_sum │
    │      timestamp      │ int64 │   int128    │
    ├─────────────────────┼───────┼─────────────┤
    │ 2023-08-04 10:00:00 │     1 │           1 │
    │ 2023-08-04 10:05:00 │     2 │           3 │
    │ 2023-08-04 10:10:00 │     3 │           9 │
    │ 2023-08-04 10:10:00 │     4 │           9 │
    │ 2023-08-04 10:20:00 │     5 │          11 │
    │ 2023-08-04 10:20:00 │     6 │          11 │
    └─────────────────────┴───────┴─────────────┘