I'm trying to write a script in Polars that would flatten a list of prices per date and minute. The catch is I want to incrementally aggregate into columns and zero out values in the future. For example. Idea is to make this solution vectorized if possible to make it performant.
df = pl.DataFrame({
"date": ["2022-01-01", "2022-01-01", "2022-01-02", "2022-01-02",
"2022-01-02", "2022-01-03", "2022-01-03", "2022-01-03"],
"minute": [1, 2, 1, 2, 3, 1, 2, 3],
"price": [10, 20, 15, 10, 20, 30, 60, 70]
})
Should build the following dataframe.
shape: (8, 5)
┌────────────┬────────┬─────────┬─────────┬───────────┐
│ date ┆ minute ┆ 1_price ┆ 2_price ┆ 3_price │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪════════╪═════════╪═════════╪═══════════╡
│ 2022-01-01 ┆ 1 ┆ 10 ┆ 0 ┆ 0 │
│ 2022-01-01 ┆ 2 ┆ 10 ┆ 20 ┆ 0 │
│ 2022-01-02 ┆ 1 ┆ 15 ┆ 0 ┆ 0 │
│ 2022-01-02 ┆ 2 ┆ 15 ┆ 10 ┆ 0 │
│ 2022-01-02 ┆ 3 ┆ 15 ┆ 10 ┆ 20 │
│ 2022-01-03 ┆ 1 ┆ 30 ┆ 0 ┆ 0 │
│ 2022-01-03 ┆ 2 ┆ 30 ┆ 60 ┆ 0 │
│ 2022-01-03 ┆ 3 ┆ 30 ┆ 60 ┆ 70 │
└────────────┴────────┴─────────┴─────────┴───────────┘
This seems to work
df.join(
df.pivot('minute', index='date'),
on='date') \
.select("date", "minute",
**{f"{x}_price":pl.when(pl.lit(x)<=pl.col('minute'))
.then(pl.col(f"{x}"))
.otherwise(0)
for x in df['minute'].unique().sort()})
shape: (8, 5)
┌────────────┬────────┬─────────┬─────────┬─────────┐
│ date ┆ minute ┆ 1_price ┆ 2_price ┆ 3_price │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞════════════╪════════╪═════════╪═════════╪═════════╡
│ 2022-01-01 ┆ 1 ┆ 10 ┆ 0 ┆ 0 │
│ 2022-01-01 ┆ 2 ┆ 10 ┆ 20 ┆ 0 │
│ 2022-01-02 ┆ 1 ┆ 15 ┆ 0 ┆ 0 │
│ 2022-01-02 ┆ 2 ┆ 15 ┆ 10 ┆ 0 │
│ 2022-01-02 ┆ 3 ┆ 15 ┆ 10 ┆ 20 │
│ 2022-01-03 ┆ 1 ┆ 30 ┆ 0 ┆ 0 │
│ 2022-01-03 ┆ 2 ┆ 30 ┆ 60 ┆ 0 │
│ 2022-01-03 ┆ 3 ┆ 30 ┆ 60 ┆ 70 │
└────────────┴────────┴─────────┴─────────┴─────────┘
What your example shows is more than just flattening but it seems what you're looking for is that the x_price column should be 0 whenever the x is greater than whatever is in the minute column. That's what this does through a pivot, self join, and when/then/otherwise.