I'd like to use a function like cumsum, but that would create a set of all values contained in the column up to the point, and not to sum them
df = pl.DataFrame({"a": [1, 2, 3, 4]})
df["a"].cum_sum()
shape: (4,)
Series: 'a' [i64]
[
1
3
6
10
]
but I'd like to have something like
df["a"].cum_sum()
shape: (4,)
Series: 'a' [i64]
[
{1}
{1, 2}
{1, 2, 3}
{1, 2, 3, 4}
]
also note that I'm working on big (several Millions of rows) df, so I'd like to avoid indexing and map_elements (as I've read that it slows down a lot)
This can be achieved using pl.Expr.cumulative_eval
together with pl.Expr.unique
and pl.Expr.implode
as follows.
df.with_columns(
res=pl.col("a").cumulative_eval(pl.element().unique().implode())
)
shape: (4, 2)
┌─────┬─────────────┐
│ a ┆ res │
│ --- ┆ --- │
│ i64 ┆ list[i64] │
╞═════╪═════════════╡
│ 1 ┆ [1] │
│ 2 ┆ [1, 2] │
│ 3 ┆ [1, 2, 3] │
│ 4 ┆ [1, 2, … 4] │
└─────┴─────────────┘