I have a DataFrame (df
) that contains columns: ID
, Initial Date
, Final Date
, and Value
, and another DataFrame (dates
) that contains all the days for each ID from df
.
On the dates
dataframe i want to sum the values if exist on the range of each ID
Here is my code
import polars as pl
from datetime import datetime
data = {
"ID" : [1, 2, 3, 4, 5],
"Initial Date" : ["2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05"],
"Final Date" : ["2022-01-03", "2022-01-06", "2022-01-07", "2022-01-09", "2022-01-07"],
"Value" : [10, 20, 30, 40, 50]
}
df = pl.DataFrame(data)
dates = pl.datetime_range(
start=datetime(2022,1,1),
end=datetime(2022,1,7),
interval="1d",
eager = True,
closed = "both"
).to_frame("date")
shape: (5, 4)
┌─────┬──────────────┬────────────┬───────┐
│ ID ┆ Initial Date ┆ Final Date ┆ Value │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ i64 │
╞═════╪══════════════╪════════════╪═══════╡
│ 1 ┆ 2022-01-01 ┆ 2022-01-03 ┆ 10 │
│ 2 ┆ 2022-01-02 ┆ 2022-01-06 ┆ 20 │
│ 3 ┆ 2022-01-03 ┆ 2022-01-07 ┆ 30 │
│ 4 ┆ 2022-01-04 ┆ 2022-01-09 ┆ 40 │
│ 5 ┆ 2022-01-05 ┆ 2022-01-07 ┆ 50 │
└─────┴──────────────┴────────────┴───────┘
shape: (7, 1)
┌─────────────────────┐
│ date │
│ --- │
│ datetime[μs] │
╞═════════════════════╡
│ 2022-01-01 00:00:00 │
│ 2022-01-02 00:00:00 │
│ 2022-01-03 00:00:00 │
│ 2022-01-04 00:00:00 │
│ 2022-01-05 00:00:00 │
│ 2022-01-06 00:00:00 │
│ 2022-01-07 00:00:00 │
└─────────────────────┘
In this case, on 2022-01-01 the value would be 10. On 2022-01-02, it would be 10 + 20, and on 2022-01-03, it would be 10 + 20 + 30, and so on. In other words, I want to check if the date exists within the range of each row in the DataFrame (df
), and if it does, sum the values.
I think the aproach for this is like this:
(
dates.with_columns(
pl.sum(
pl.when(
(df["Initial Date"] <= pl.col("date")) & (df["Final Date"] >= pl.col("date"))
).then(df["Value"]).otherwise(0)
).alias("Summed Value")
)
)
update join_where()
was added in Polars 1.7.0
(
dates.join_where(
df,
pl.col("date") >= pl.col("Initial Date"),
pl.col("date") <= pl.col("Final Date"),
).group_by("date")
.agg(pl.col("Value").sum())
)
┌─────────────────────┬───────┐
│ date ┆ Value │
│ --- ┆ --- │
│ datetime[μs] ┆ i64 │
╞═════════════════════╪═══════╡
│ 2022-01-01 00:00:00 ┆ 10 │
│ 2022-01-02 00:00:00 ┆ 30 │
│ 2022-01-03 00:00:00 ┆ 60 │
│ 2022-01-04 00:00:00 ┆ 90 │
│ 2022-01-05 00:00:00 ┆ 140 │
│ 2022-01-06 00:00:00 ┆ 140 │
│ 2022-01-07 00:00:00 ┆ 120 │
└─────────────────────┴───────┘
previous
If you just want to know sum of values on each date within ranges in df
, you don't even need dates
dataframe.
date_ranges()
to create column with date ranges based on initial and final date.explode()
to convert date ranges into rows.group_by()
and agg()
to sum the values.(
df
.with_columns(date = pl.date_ranges("Initial Date", "Final Date"))
.explode("date")
.group_by("date", maintain_order = True)
.agg(pl.col.Value.sum())
)
┌────────────┬───────┐
│ date ┆ Value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 10 │
│ 2022-01-02 ┆ 30 │
│ 2022-01-03 ┆ 60 │
│ 2022-01-04 ┆ 90 │
│ 2022-01-05 ┆ 140 │
│ 2022-01-06 ┆ 140 │
│ 2022-01-07 ┆ 120 │
│ 2022-01-08 ┆ 40 │
│ 2022-01-09 ┆ 40 │
└────────────┴───────┘
If you really want to use dates
then you can join()
result on dates
dataframe:
(
df
.with_columns(date = pl.date_ranges("Initial Date", "Final Date"))
.explode("date")
.group_by("date", maintain_order = True)
.agg(pl.col.Value.sum())
.join(dates, on="date", how="semi")
)
┌────────────┬───────┐
│ date ┆ Value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 10 │
│ 2022-01-02 ┆ 30 │
│ 2022-01-03 ┆ 60 │
│ 2022-01-04 ┆ 90 │
│ 2022-01-05 ┆ 140 │
│ 2022-01-06 ┆ 140 │
│ 2022-01-07 ┆ 120 │
└────────────┴───────┘
Or just filter()
the result:
(
df
.with_columns(date = pl.date_ranges("Initial Date", "Final Date"))
.explode("date")
.group_by("date", maintain_order = True)
.agg(pl.col.Value.sum())
.filter(pl.col.date.is_between(datetime(2022,1,1), datetime(2022,1,7)))
)
┌────────────┬───────┐
│ date ┆ Value │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2022-01-01 ┆ 10 │
│ 2022-01-02 ┆ 30 │
│ 2022-01-03 ┆ 60 │
│ 2022-01-04 ┆ 90 │
│ 2022-01-05 ┆ 140 │
│ 2022-01-06 ┆ 140 │
│ 2022-01-07 ┆ 120 │
└────────────┴───────┘
Alternative solution would be to use join on inequality, but polars is not great on it (yet). But in this case you can use DuckDB integration with Polars.
duckdb.sql("""
select
d.date,
sum(df.value) as value
from df
inner join dates as d on
d.date between df."Initial Date" and df."Final Date"
group by
d.date
order by
d.date
""").pl()
┌─────────────────────┬───────────────┐
│ date ┆ value │
│ --- ┆ --- │
│ datetime[μs] ┆ decimal[38,0] │
╞═════════════════════╪═══════════════╡
│ 2022-01-01 00:00:00 ┆ 10 │
│ 2022-01-02 00:00:00 ┆ 30 │
│ 2022-01-03 00:00:00 ┆ 60 │
│ 2022-01-04 00:00:00 ┆ 90 │
│ 2022-01-05 00:00:00 ┆ 140 │
│ 2022-01-06 00:00:00 ┆ 140 │
│ 2022-01-07 00:00:00 ┆ 120 │
└─────────────────────┴───────────────┘