Consider this dummy dataset,
import pandas as pd
import numpy as np
np.random.seed(44)
num_rows = 8
data = {
'item_id': np.random.choice(['A', 'B'], num_rows),
'store_id': np.random.choice([1, 2], num_rows),
'sold_quantity': np.random.randint(0, 5, num_rows),
'total_sku_count': np.random.choice([0, 1], num_rows)
}
df = pd.DataFrame(data)
item_id store_id sold_quantity total_sku_count
0 A 2 3 1
1 B 2 3 1
2 B 1 3 1
3 B 1 4 0
4 B 2 1 0
5 B 2 1 0
6 A 1 0 1
7 A 2 4 1
I can calculate subset-wise cumulative sum like this in pandas,
subset = ["item_id",'store_id']
df['cum_count'] = df.groupby(subset).cumcount()+1
item_id store_id sold_quantity total_sku_count count cum_count
0 A 2 3 1 1 1
1 B 2 3 1 1 1
2 B 1 3 1 1 1
3 B 1 4 0 2 2
4 B 2 1 0 2 2
5 B 2 1 0 3 3
6 A 1 0 1 1 1
7 A 2 4 1 2 2
I'm trying to implement the same in Polars. Since I have relatively less experience with Polars, I'm having a hard time doing this. I've tried something like this, but it didn't work as expected,
subset = ["item_id",'store_id']
df = df.with_columns((pl.struct(subset).over(subset).cum_count()).alias("cum_counts"))
Please help me if there is a way to achieve this.Your support is much appreciated.
The position of the .over()
is important.
You want the .cum_count()
for each group - so it must come before the .over()
df.with_columns(
pl.struct(subset).cum_count().over(subset).alias("cum_counts")
)
shape: (8, 5)
┌─────────┬──────────┬───────────────┬─────────────────┬────────────┐
│ item_id ┆ store_id ┆ sold_quantity ┆ total_sku_count ┆ cum_counts │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ i64 ┆ u32 │
╞═════════╪══════════╪═══════════════╪═════════════════╪════════════╡
│ A ┆ 2 ┆ 3 ┆ 1 ┆ 0 │
│ B ┆ 2 ┆ 3 ┆ 1 ┆ 0 │
│ B ┆ 1 ┆ 3 ┆ 1 ┆ 0 │
│ B ┆ 1 ┆ 4 ┆ 0 ┆ 1 │
│ B ┆ 2 ┆ 1 ┆ 0 ┆ 1 │
│ B ┆ 2 ┆ 1 ┆ 0 ┆ 2 │
│ A ┆ 1 ┆ 0 ┆ 1 ┆ 0 │
│ A ┆ 2 ┆ 4 ┆ 1 ┆ 1 │
└─────────┴──────────┴───────────────┴─────────────────┴────────────┘