pandaspython-polars

How can I sum cumulatively values on condition in polars or pandas?


I have a problem i'm trying to solve using preferably polars but pandas is also fine. Say we have the following dataset (sample):

import polars as pl

df = pl.from_repr("""
┌─────────────────────┬───────────┬───────────────────┐
│ date                ┆ customers ┆ is_reporting_day? │
│ ---                 ┆ ---       ┆ ---               │
│ datetime[ns]        ┆ i64       ┆ bool              │
╞═════════════════════╪═══════════╪═══════════════════╡
│ 2022-01-01 00:00:00 ┆ 3         ┆ true              │
│ 2022-01-02 00:00:00 ┆ 4         ┆ false             │
│ 2022-01-03 00:00:00 ┆ 5         ┆ false             │
│ 2022-01-04 00:00:00 ┆ 3         ┆ false             │
│ 2022-01-05 00:00:00 ┆ 2         ┆ true              │
└─────────────────────┴───────────┴───────────────────┘
""")

# df = df.to_pandas() # for a pandas solution

What i want to get is:

So after applying the transformation it should look like this:

┌─────────────────────┬───────────┬───────────────────┬─────────┐
│ date                ┆ customers ┆ is_reporting_day? ┆ cum_sum │
│ ---                 ┆ ---       ┆ ---               ┆ ---     │
│ datetime[ns]        ┆ i64       ┆ str               ┆ i64     │
╞═════════════════════╪═══════════╪═══════════════════╪═════════╡
│ 2022-01-01 00:00:00 ┆ 3         ┆ True              ┆ 3       │
│ 2022-01-05 00:00:00 ┆ 2         ┆ True              ┆ 14      │
└─────────────────────┴───────────┴───────────────────┴─────────┘

I've tried to use the cum_sum() function in polars by using the pl.when statement but it's the incorrect logic since it sums up from the beginning, i.e., the first day (there are about 700 days).

Note: the solution should be dynamic, i.e., sometimes the gap between a reporting_day and non-reporting_day is 1 day, 2 days, etc.

Any ideas or input is highly appreciated! Thanks in advance!

‎‎‎‎‎‎‎‎‎‎‎‎


Solution

  • One way to approach the problem is to create groups based on the is_reporting_day? column.

    If we keep the date value when is_reporting_day? is True:

    df.select(
       pl.when("is_reporting_day?").then(pl.col("date"))
    )
    
    shape: (5, 1)
    ┌────────────┐
    │ date       │
    │ ---        │
    │ date       │
    ╞════════════╡
    │ 2022-01-01 │
    │ null       │
    │ null       │
    │ null       │
    │ 2022-01-05 │
    └────────────┘
    

    We can .backward_fill() this to include the previous False rows as part of the group.

    date = pl.when("is_reporting_day?").then(pl.col("date"))
    
    (df.group_by(date.backward_fill(), maintain_order=True)
       .agg(
          pl.all().last(),
          pl.col("customers").sum().name.suffix("_sum")
       )
    )
    
    shape: (2, 4)
    ┌────────────┬───────────┬───────────────────┬───────────────┐
    │ date       ┆ customers ┆ is_reporting_day? ┆ customers_sum │
    │ ---        ┆ ---       ┆ ---               ┆ ---           │
    │ date       ┆ i64       ┆ bool              ┆ i64           │
    ╞════════════╪═══════════╪═══════════════════╪═══════════════╡
    │ 2022-01-01 ┆ 3         ┆ true              ┆ 3             │
    │ 2022-01-05 ┆ 2         ┆ true              ┆ 14            │
    └────────────┴───────────┴───────────────────┴───────────────┘