pythonpython-polarscumsum

polars cum sum to create a set and not actually sum


I'd like to use a function like cumsum, but that would create a set of all values contained in the column up to the point, and not to sum them

df = pl.DataFrame({"a": [1, 2, 3, 4]})
df["a"].cum_sum()
shape: (4,)
Series: 'a' [i64]
[
    1
    3
    6
    10
]

but I'd like to have something like

df["a"].cum_sum()
shape: (4,)
Series: 'a' [i64]
[
    {1}
    {1, 2}
    {1, 2, 3}
    {1, 2, 3, 4}
]

also note that I'm working on big (several Millions of rows) df, so I'd like to avoid indexing and map_elements (as I've read that it slows down a lot)


Solution

  • This can be achieved using pl.Expr.cumulative_eval together with pl.Expr.unique and pl.Expr.implode as follows.

    df.with_columns(
        res=pl.col("a").cumulative_eval(pl.element().unique().implode())
    )
    
    shape: (4, 2)
    ┌─────┬─────────────┐
    │ a   ┆ res         │
    │ --- ┆ ---         │
    │ i64 ┆ list[i64]   │
    ╞═════╪═════════════╡
    │ 1   ┆ [1]         │
    │ 2   ┆ [1, 2]      │
    │ 3   ┆ [1, 2, 3]   │
    │ 4   ┆ [1, 2, … 4] │
    └─────┴─────────────┘