pythonpython-polarscumulative-sum

Polars cumulative count over sequential dates


Here's some sample data

import polars as pl

df = pl.DataFrame(
    {
        "date": [
            "2024-08-01",
            "2024-08-02",
            "2024-08-03",
            "2024-08-04",
            "2024-08-04",
            "2024-08-05",
            "2024-08-06",
            "2024-08-08",
            "2024-08-09",
        ],
        "type": ["A", "A", "A", "A", "B", "B", "B", "A", "A"],
    }
).with_columns(pl.col("date").str.to_date())

And my desired output would look something like this

shape: (9, 3)
┌────────────┬──────┬───────────────┐
│ date       ┆ type ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---           │
│ date       ┆ str  ┆ i64           │
╞════════════╪══════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 2             │
│ 2024-08-03 ┆ A    ┆ 3             │
│ 2024-08-04 ┆ A    ┆ 4             │
│ 2024-08-04 ┆ B    ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 2             │
│ 2024-08-06 ┆ B    ┆ 3             │
│ 2024-08-08 ┆ A    ┆ 1             │
│ 2024-08-09 ┆ A    ┆ 2             │
└────────────┴──────┴───────────────┘

Where my days_in_a_row counter gets reset upon a date gap greater than 1 day.

What I've tried so far

df.with_columns(days_in_a_row=pl.cum_count("date").over("type"))

Which gives me

shape: (9, 3)
┌────────────┬──────┬───────────────┐
│ date       ┆ type ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---           │
│ date       ┆ str  ┆ u32           │
╞════════════╪══════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 2             │
│ 2024-08-03 ┆ A    ┆ 3             │
│ 2024-08-04 ┆ A    ┆ 4             │
│ 2024-08-04 ┆ B    ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 2             │
│ 2024-08-06 ┆ B    ┆ 3             │
│ 2024-08-08 ┆ A    ┆ 5             │
│ 2024-08-09 ┆ A    ┆ 6             │
└────────────┴──────┴───────────────┘

Which is not resetting after the gap. I can't quite nail this one down.

I've also tried variations with

df
    .with_columns(date_gap=pl.col("date").diff().over("type"))
    .with_columns(days_in_a_row=(pl.cum_count("date").over("date_gap", "type")))

Which get's closer, but it still ends up not resetting where I'd want it

shape: (9, 4)
┌────────────┬──────┬──────────────┬───────────────┐
│ date       ┆ type ┆ date_gap     ┆ days_in_a_row │
│ ---        ┆ ---  ┆ ---          ┆ ---           │
│ date       ┆ str  ┆ duration[ms] ┆ u32           │
╞════════════╪══════╪══════════════╪═══════════════╡
│ 2024-08-01 ┆ A    ┆ null         ┆ 1             │
│ 2024-08-02 ┆ A    ┆ 1d           ┆ 1             │
│ 2024-08-03 ┆ A    ┆ 1d           ┆ 2             │
│ 2024-08-04 ┆ A    ┆ 1d           ┆ 3             │
│ 2024-08-04 ┆ B    ┆ null         ┆ 1             │
│ 2024-08-05 ┆ B    ┆ 1d           ┆ 1             │
│ 2024-08-06 ┆ B    ┆ 1d           ┆ 2             │
│ 2024-08-08 ┆ A    ┆ 4d           ┆ 1             │
│ 2024-08-09 ┆ A    ┆ 1d           ┆ 4             │
└────────────┴──────┴──────────────┴───────────────┘

Solution

  • What about first creating a secondary grouper based on the duration ≥ 1day?

    from datetime import timedelta
    
    (df.with_columns(group=pl.col("date").diff().gt(timedelta(days=1))
                             .fill_null(True).cum_sum().over("type"))
       .with_columns(days_in_a_row=pl.cum_count("date").over(["type", "group"]))
    )
    

    Output:

    ┌────────────┬──────┬───────┬───────────────┐
    │ date       ┆ type ┆ group ┆ days_in_a_row │
    │ ---        ┆ ---  ┆ ---   ┆ ---           │
    │ date       ┆ str  ┆ u32   ┆ u32           │
    ╞════════════╪══════╪═══════╪═══════════════╡
    │ 2024-08-01 ┆ A    ┆ 1     ┆ 1             │
    │ 2024-08-02 ┆ A    ┆ 1     ┆ 2             │
    │ 2024-08-03 ┆ A    ┆ 1     ┆ 3             │
    │ 2024-08-04 ┆ A    ┆ 1     ┆ 4             │
    │ 2024-08-04 ┆ B    ┆ 1     ┆ 1             │
    │ 2024-08-05 ┆ B    ┆ 1     ┆ 2             │
    │ 2024-08-06 ┆ B    ┆ 1     ┆ 3             │
    │ 2024-08-08 ┆ A    ┆ 2     ┆ 1             │
    │ 2024-08-09 ┆ A    ┆ 2     ┆ 2             │
    └────────────┴──────┴───────┴───────────────┘