pythondataframeetlpython-polarsrust-polars

Polars - Flatten rows into columns, aggregating by column values


I'm trying to write a script in Polars that would flatten a list of prices per date and minute. The catch is I want to incrementally aggregate into columns and zero out values in the future. For example. Idea is to make this solution vectorized if possible to make it performant.

df = pl.DataFrame({
    "date": ["2022-01-01", "2022-01-01", "2022-01-02", "2022-01-02", 
             "2022-01-02", "2022-01-03", "2022-01-03", "2022-01-03"],
    "minute": [1, 2, 1, 2, 3, 1, 2, 3],
    "price": [10, 20, 15, 10, 20, 30, 60, 70]
})

Should build the following dataframe.

shape: (8, 5)
┌────────────┬────────┬─────────┬─────────┬───────────┐
│ date       ┆ minute ┆ 1_price ┆ 2_price ┆ 3_price   │
│ ---        ┆ ---    ┆ ---     ┆ ---     ┆ ---       │
│ str        ┆ i64    ┆ i64     ┆ i64     ┆ i64       │
╞════════════╪════════╪═════════╪═════════╪═══════════╡
│ 2022-01-01 ┆ 1      ┆ 10      ┆ 0       ┆ 0         │
│ 2022-01-01 ┆ 2      ┆ 10      ┆ 20      ┆ 0         │
│ 2022-01-02 ┆ 1      ┆ 15      ┆ 0       ┆ 0         │
│ 2022-01-02 ┆ 2      ┆ 15      ┆ 10      ┆ 0         │
│ 2022-01-02 ┆ 3      ┆ 15      ┆ 10      ┆ 20        │
│ 2022-01-03 ┆ 1      ┆ 30      ┆ 0       ┆ 0         │
│ 2022-01-03 ┆ 2      ┆ 30      ┆ 60      ┆ 0         │
│ 2022-01-03 ┆ 3      ┆ 30      ┆ 60      ┆ 70        │
└────────────┴────────┴─────────┴─────────┴───────────┘

Solution

  • This seems to work

    df.join(
        df.pivot('minute', index='date'),
        on='date') \
        .select("date", "minute",
                **{f"{x}_price":pl.when(pl.lit(x)<=pl.col('minute'))
                                  .then(pl.col(f"{x}"))
                                  .otherwise(0)
                                  for x in df['minute'].unique().sort()})
    shape: (8, 5)
    ┌────────────┬────────┬─────────┬─────────┬─────────┐
    │ date       ┆ minute ┆ 1_price ┆ 2_price ┆ 3_price │
    │ ---        ┆ ---    ┆ ---     ┆ ---     ┆ ---     │
    │ str        ┆ i64    ┆ i64     ┆ i64     ┆ i64     │
    ╞════════════╪════════╪═════════╪═════════╪═════════╡
    │ 2022-01-01 ┆ 1      ┆ 10      ┆ 0       ┆ 0       │
    │ 2022-01-01 ┆ 2      ┆ 10      ┆ 20      ┆ 0       │
    │ 2022-01-02 ┆ 1      ┆ 15      ┆ 0       ┆ 0       │
    │ 2022-01-02 ┆ 2      ┆ 15      ┆ 10      ┆ 0       │
    │ 2022-01-02 ┆ 3      ┆ 15      ┆ 10      ┆ 20      │
    │ 2022-01-03 ┆ 1      ┆ 30      ┆ 0       ┆ 0       │
    │ 2022-01-03 ┆ 2      ┆ 30      ┆ 60      ┆ 0       │
    │ 2022-01-03 ┆ 3      ┆ 30      ┆ 60      ┆ 70      │
    └────────────┴────────┴─────────┴─────────┴─────────┘
    
    

    What your example shows is more than just flattening but it seems what you're looking for is that the x_price column should be 0 whenever the x is greater than whatever is in the minute column. That's what this does through a pivot, self join, and when/then/otherwise.