Here's some sample data
import polars as pl
df = pl.DataFrame(
{
"date": [
"2024-08-01",
"2024-08-02",
"2024-08-03",
"2024-08-04",
"2024-08-04",
"2024-08-05",
"2024-08-06",
"2024-08-08",
"2024-08-09",
],
"type": ["A", "A", "A", "A", "B", "B", "B", "A", "A"],
}
).with_columns(pl.col("date").str.to_date())
And my desired output would look something like this
shape: (9, 3)
┌────────────┬──────┬───────────────┐
│ date ┆ type ┆ days_in_a_row │
│ --- ┆ --- ┆ --- │
│ date ┆ str ┆ i64 │
╞════════════╪══════╪═══════════════╡
│ 2024-08-01 ┆ A ┆ 1 │
│ 2024-08-02 ┆ A ┆ 2 │
│ 2024-08-03 ┆ A ┆ 3 │
│ 2024-08-04 ┆ A ┆ 4 │
│ 2024-08-04 ┆ B ┆ 1 │
│ 2024-08-05 ┆ B ┆ 2 │
│ 2024-08-06 ┆ B ┆ 3 │
│ 2024-08-08 ┆ A ┆ 1 │
│ 2024-08-09 ┆ A ┆ 2 │
└────────────┴──────┴───────────────┘
Where my days_in_a_row
counter gets reset upon a date gap greater than 1 day.
What I've tried so far
df.with_columns(days_in_a_row=pl.cum_count("date").over("type"))
Which gives me
shape: (9, 3)
┌────────────┬──────┬───────────────┐
│ date ┆ type ┆ days_in_a_row │
│ --- ┆ --- ┆ --- │
│ date ┆ str ┆ u32 │
╞════════════╪══════╪═══════════════╡
│ 2024-08-01 ┆ A ┆ 1 │
│ 2024-08-02 ┆ A ┆ 2 │
│ 2024-08-03 ┆ A ┆ 3 │
│ 2024-08-04 ┆ A ┆ 4 │
│ 2024-08-04 ┆ B ┆ 1 │
│ 2024-08-05 ┆ B ┆ 2 │
│ 2024-08-06 ┆ B ┆ 3 │
│ 2024-08-08 ┆ A ┆ 5 │
│ 2024-08-09 ┆ A ┆ 6 │
└────────────┴──────┴───────────────┘
Which is not resetting after the gap. I can't quite nail this one down.
I've also tried variations with
df
.with_columns(date_gap=pl.col("date").diff().over("type"))
.with_columns(days_in_a_row=(pl.cum_count("date").over("date_gap", "type")))
Which get's closer, but it still ends up not resetting where I'd want it
shape: (9, 4)
┌────────────┬──────┬──────────────┬───────────────┐
│ date ┆ type ┆ date_gap ┆ days_in_a_row │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ duration[ms] ┆ u32 │
╞════════════╪══════╪══════════════╪═══════════════╡
│ 2024-08-01 ┆ A ┆ null ┆ 1 │
│ 2024-08-02 ┆ A ┆ 1d ┆ 1 │
│ 2024-08-03 ┆ A ┆ 1d ┆ 2 │
│ 2024-08-04 ┆ A ┆ 1d ┆ 3 │
│ 2024-08-04 ┆ B ┆ null ┆ 1 │
│ 2024-08-05 ┆ B ┆ 1d ┆ 1 │
│ 2024-08-06 ┆ B ┆ 1d ┆ 2 │
│ 2024-08-08 ┆ A ┆ 4d ┆ 1 │
│ 2024-08-09 ┆ A ┆ 1d ┆ 4 │
└────────────┴──────┴──────────────┴───────────────┘
What about first creating a secondary grouper based on the duration ≥ 1day?
from datetime import timedelta
(df.with_columns(group=pl.col("date").diff().gt(timedelta(days=1))
.fill_null(True).cum_sum().over("type"))
.with_columns(days_in_a_row=pl.cum_count("date").over(["type", "group"]))
)
Output:
┌────────────┬──────┬───────┬───────────────┐
│ date ┆ type ┆ group ┆ days_in_a_row │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ str ┆ u32 ┆ u32 │
╞════════════╪══════╪═══════╪═══════════════╡
│ 2024-08-01 ┆ A ┆ 1 ┆ 1 │
│ 2024-08-02 ┆ A ┆ 1 ┆ 2 │
│ 2024-08-03 ┆ A ┆ 1 ┆ 3 │
│ 2024-08-04 ┆ A ┆ 1 ┆ 4 │
│ 2024-08-04 ┆ B ┆ 1 ┆ 1 │
│ 2024-08-05 ┆ B ┆ 1 ┆ 2 │
│ 2024-08-06 ┆ B ┆ 1 ┆ 3 │
│ 2024-08-08 ┆ A ┆ 2 ┆ 1 │
│ 2024-08-09 ┆ A ┆ 2 ┆ 2 │
└────────────┴──────┴───────┴───────────────┘