Let's say I have the following df:
import numpy as np
import polars as pl
df = pl.DataFrame({
"date": pl.date_range(pl.date(2022, 1, 1), pl.date(2022, 1, 30), interval="1d", eager=True),
"group": ["A", "B", "C"]*10,
"total": np.random.randint(100, size=30)
})
Now I want to sum the total per group, in a time window of 5 days. Something like combining groupby with rolling.
I wonder if there more elegant and efficient way to do it than the following code:
df_list = []
for table_group in df["group"].unique():
temp_df = (
df
.filter(pl.col("group")==table_group)
.sort("date")
.rolling("date", period="5d")
.agg(pl.col("total").sum().name.prefix("roll_"))
)
temp_df2 = (
df
.filter(pl.col("group")==table_group)
.sort("date")
.select(
pl.all(),
temp_df.get_column("roll_total")
)
)
df_list.append(temp_df2)
pl.concat(df_list)
My main issue is that I have millions of rows and over 10,000 groups(and more than 1 column that I want to sum)... So this code is not so elegant and consumes more memory than I want it to.
It appears that the piece you're missing is the group_by
parameter.
This allows you to specify grouping.
(df.rolling(index_column="date", group_by="group", period="5d")
.agg(
pl.col("total").last(),
roll_total = pl.sum("total")
)
)
shape: (30, 4)
┌───────┬────────────┬───────┬────────────┐
│ group ┆ date ┆ total ┆ roll_total │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ date ┆ i64 ┆ i64 │
╞═══════╪════════════╪═══════╪════════════╡
│ A ┆ 2022-01-01 ┆ 96 ┆ 96 │
│ A ┆ 2022-01-04 ┆ 16 ┆ 112 │
│ A ┆ 2022-01-07 ┆ 17 ┆ 33 │
│ A ┆ 2022-01-10 ┆ 4 ┆ 21 │
│ … ┆ … ┆ … ┆ … │
│ C ┆ 2022-01-21 ┆ 17 ┆ 78 │
│ C ┆ 2022-01-24 ┆ 84 ┆ 101 │
│ C ┆ 2022-01-27 ┆ 78 ┆ 162 │
│ C ┆ 2022-01-30 ┆ 54 ┆ 132 │
└───────┴────────────┴───────┴────────────┘