I have the following df:
Start_Date End_Date Relevant Volume
2024-10-01 2024-12-31 False 0.000000
2025-01-01 2025-03-31 True 0.097989
2025-04-01 2025-06-30 True -0.014449
2025-01-01 2025-12-31 True 0.195327
2026-01-01 2026-12-31 False 0.000000
I need an hourly index starting/ending at the first/last dates, where Relevant == True. I do this as follows:
relevant_df = df[df['Relevant']]
earliest_start = relevant_df['Start_Date'].min()
latest_end = relevant_df['End_Date'].max()
# Create DateTime index
date_range = pd.date_range(start=earliest_start, end=latest_end, freq='H')
aggregated_volumes = pd.Series(index=date_range, dtype=float)
Now, how can I take the Volume per periode and add them together so that, in this ex, the first three months of 2025, the volume per hour equals 0.097989 + 0.195327 second quarter -0.014449 + 0.195327 etc..
Since your intervals are overlapping, I believe there is no direct way to index your values.
You can however create a NxM numpy array (N: number of rows, M: number of True rows) and sum:
# ensure datetime
relevant_df[['Start_Date', 'End_Date']] = relevant_df[['Start_Date', 'End_Date']].apply(pd.to_datetime)
# compute a mask of the values between start/end
idx = date_range.to_numpy()[:, None]
m1 = idx>=relevant_df['Start_Date'].to_numpy()
m2 = idx<relevant_df['End_Date'].to_numpy()
# broadcast the values, sum, convert to Series
out = pd.Series(np.nansum(np.where(m1&m2, relevant_df['Volume'].to_numpy(), np.nan), axis=1),
index=date_range)
Output:
2025-01-01 00:00:00 0.293316
2025-01-01 01:00:00 0.293316
2025-01-01 02:00:00 0.293316
2025-01-01 03:00:00 0.293316
2025-01-01 04:00:00 0.293316
...
2025-12-30 20:00:00 0.195327
2025-12-30 21:00:00 0.195327
2025-12-30 22:00:00 0.195327
2025-12-30 23:00:00 0.195327
2025-12-31 00:00:00 0.000000
Freq: h, Length: 8737, dtype: float64
Alternatively, if you have sparse intervals relative to the full range:
out = pd.DataFrame([pd.Series(v, index=pd.date_range(s, e, inclusive='left', freq='h'))
for s,e,v in zip(relevant_df['Start_Date'],
relevant_df['End_Date'],
relevant_df['Volume']
)]).sum()
Output:
2025-01-01 00:00:00 0.293316
2025-01-01 01:00:00 0.293316
2025-01-01 02:00:00 0.293316
2025-01-01 03:00:00 0.293316
2025-01-01 04:00:00 0.293316
...
2025-12-30 19:00:00 0.195327
2025-12-30 20:00:00 0.195327
2025-12-30 21:00:00 0.195327
2025-12-30 22:00:00 0.195327
2025-12-30 23:00:00 0.195327
Freq: h, Length: 8736, dtype: float64
Visual output: