I have observations with start and end date of the following format:
import pandas as pd
data = pd.DataFrame({
'start_date':pd.to_datetime(['2021-01-07','2021-01-04','2021-01-12','2021-01-03']),
'end_date':pd.to_datetime(['2021-01-16','2021-01-12','2021-01-13','2021-01-15']),
'value':[7,6,5,4]
})
data
start_date end_date value
0 2021-01-07 2021-01-16 7
1 2021-01-04 2021-01-12 6
2 2021-01-12 2021-01-13 5
3 2021-01-03 2021-01-15 4
The date ranges between observations overlap. I would like to compute the daily sum aggregated across all observations.
My version with a loop (below) is slow and crashes for ~100k observations. What would be a way to speed things up?
def turn_data_into_date_range(row):
dates = pd.date_range(start=row.start_date, end=row.end_date)
return pd.Series(data=row.value, index=dates)
out = []
for index, row in data.iterrows():
out.append(turn_data_into_date_range(row))
result = pd.concat(out, axis=1).sum(axis=1)
result
2021-01-03 4.0
2021-01-04 10.0
2021-01-05 10.0
2021-01-06 10.0
2021-01-07 17.0
2021-01-08 17.0
2021-01-09 17.0
2021-01-10 17.0
2021-01-11 17.0
2021-01-12 22.0
2021-01-13 16.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 7.0
Freq: D, dtype: float64
PS: the answer to this related question doesn't work in my case, as they have non-overlapping observations and can use a left join: Convert Date Ranges to Time Series in Pandas
I feel this problem comes back regularly as it’s not an easy thing to do. Some techniques would probably transform each row into a date range or otherwise iterate on rows. In this case there’s a smarter workaround, which is to use cumulative sums, then reindex.
>>> starts = data.set_index('start_date')['value'].sort_index().cumsum()
>>> starts
start_date
2021-01-03 4
2021-01-04 10
2021-01-07 17
2021-01-12 22
Name: value, dtype: int64
>>> ends = data.set_index('end_date')['value'].sort_index().cumsum()
>>> ends
end_date
2021-01-12 6
2021-01-13 11
2021-01-15 15
2021-01-16 22
Name: value, dtype: int64
In case your dates are not unique, you could group by by date and sum first. Then the series definitions are as follows:
>>> starts = data.groupby('start_date')['value'].sum().sort_index().cumsum()
>>> ends = data.groupby('end_date')['value'].sum().sort_index().cumsum()
Note that here we don’t need the set_index()
anymore which is done by sum()
as it is an aggregation, contrarily to .cumsum()
which is a transform operation.
Of course if the ends
are inclusive you might need to add a .shift()
:
>>> dates = pd.date_range(starts.index.min(), ends.index.max())
>>> ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03 0.0
2021-01-04 0.0
2021-01-05 0.0
2021-01-06 0.0
2021-01-07 0.0
2021-01-08 0.0
2021-01-09 0.0
2021-01-10 0.0
2021-01-11 0.0
2021-01-12 0.0
2021-01-13 6.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 15.0
Freq: D, Name: value, dtype: float64
Then just subtract the (possibly shifted) ends from the starts:
>>> starts.reindex(dates).ffill() - ends.reindex(dates).ffill().shift().fillna(0)
2021-01-03 4.0
2021-01-04 10.0
2021-01-05 10.0
2021-01-06 10.0
2021-01-07 17.0
2021-01-08 17.0
2021-01-09 17.0
2021-01-10 17.0
2021-01-11 17.0
2021-01-12 22.0
2021-01-13 16.0
2021-01-14 11.0
2021-01-15 11.0
2021-01-16 7.0
Freq: D, Name: value, dtype: float64