I have a problem i'm trying to solve using preferably polars but pandas is also fine. Say we have the following dataset (sample):
import polars as pl
df = pl.from_repr("""
┌─────────────────────┬───────────┬───────────────────┐
│ date ┆ customers ┆ is_reporting_day? │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ i64 ┆ bool │
╞═════════════════════╪═══════════╪═══════════════════╡
│ 2022-01-01 00:00:00 ┆ 3 ┆ true │
│ 2022-01-02 00:00:00 ┆ 4 ┆ false │
│ 2022-01-03 00:00:00 ┆ 5 ┆ false │
│ 2022-01-04 00:00:00 ┆ 3 ┆ false │
│ 2022-01-05 00:00:00 ┆ 2 ┆ true │
└─────────────────────┴───────────┴───────────────────┘
""")
# df = df.to_pandas() # for a pandas solution
What i want to get is:
True
keep the customers number as isFalse
: sum up all the customers (4, 5, 3 = 12 + 2 = 14) and add it to the next True
value in reporting_daySo after applying the transformation it should look like this:
┌─────────────────────┬───────────┬───────────────────┬─────────┐
│ date ┆ customers ┆ is_reporting_day? ┆ cum_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ i64 ┆ str ┆ i64 │
╞═════════════════════╪═══════════╪═══════════════════╪═════════╡
│ 2022-01-01 00:00:00 ┆ 3 ┆ True ┆ 3 │
│ 2022-01-05 00:00:00 ┆ 2 ┆ True ┆ 14 │
└─────────────────────┴───────────┴───────────────────┴─────────┘
I've tried to use the cum_sum()
function in polars by using the pl.when
statement but it's the incorrect logic since it sums up from the beginning, i.e., the first day (there are about 700 days).
Note: the solution should be dynamic, i.e., sometimes the gap between a reporting_day and non-reporting_day is 1 day, 2 days, etc.
Any ideas or input is highly appreciated! Thanks in advance!
One way to approach the problem is to create groups based on the is_reporting_day?
column.
If we keep the date
value when is_reporting_day?
is True:
df.select(
pl.when("is_reporting_day?").then(pl.col("date"))
)
shape: (5, 1)
┌────────────┐
│ date │
│ --- │
│ date │
╞════════════╡
│ 2022-01-01 │
│ null │
│ null │
│ null │
│ 2022-01-05 │
└────────────┘
We can .backward_fill()
this to include the previous False rows as part of the group.
date = pl.when("is_reporting_day?").then(pl.col("date"))
(df.group_by(date.backward_fill(), maintain_order=True)
.agg(
pl.all().last(),
pl.col("customers").sum().name.suffix("_sum")
)
)
shape: (2, 4)
┌────────────┬───────────┬───────────────────┬───────────────┐
│ date ┆ customers ┆ is_reporting_day? ┆ customers_sum │
│ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ bool ┆ i64 │
╞════════════╪═══════════╪═══════════════════╪═══════════════╡
│ 2022-01-01 ┆ 3 ┆ true ┆ 3 │
│ 2022-01-05 ┆ 2 ┆ true ┆ 14 │
└────────────┴───────────┴───────────────────┴───────────────┘