pythonpandasdataframepandas-groupbypandas-merge

Aggregating Values using Date Ranges in Another Dataframe


I need to sum all values from maindata using master_records. Many values for ids will not get summed even if there are timestamps and values for these columns.

import pandas as pd
 
#Proxy reference dataframe
master_records = [['site a', '2021-03-05 02:00:00', '2021-03-05 03:00:00'], 
        ['site a', '2021-03-05 06:00:00', '2021-03-05 08:00:00'], 
        ['site b', '2021-04-08 10:00:00', '2021-04-08 13:00:00']] 
 
mst_df = pd.DataFrame(master_records, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)


#Proxy main high frequency dataframe
main_data = [['id a','2021-03-05 00:00:00', 10], #not aggregated
        ['id a','2021-03-05 01:00:00', 19], #not aggregated
        ['id a','2021-03-05 02:00:00', 9], 
        ['id a','2021-03-05 03:00:00', 16],
        ['id a','2021-03-05 04:00:00', 16], #not aggregated
        ['id a','2021-03-05 05:00:00', 11], #not aggregated
        ['id a','2021-03-05 06:00:00', 16], 
        ['id a','2021-03-05 07:00:00', 12], 
        ['id a','2021-03-05 08:00:00', 9], 
        ['id b','2021-04-08 10:00:00', 11], 
        ['id b','2021-04-08 11:00:00', 10], 
        ['id b','2021-04-08 12:00:00', 19], 
        ['id b','2021-04-08 13:00:00', 10], 
        ['id b','2021-04-08 14:00:00', 16]] #not aggregated
 
# Create the pandas DataFrame
maindata = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
maindata['timestamp'] = pd.to_datetime(maindata['timestamp'], infer_datetime_format=True)

The desired DataFrame looks like:

print(mst_df)
id  start   end                                      sum(value)
0   site a  2021-03-05 02:00:00 2021-03-05 03:00:00  25
1   site a  2021-03-05 06:00:00 2021-03-05 08:00:00  37
2   site b  2021-04-08 10:00:00 2021-04-08 13:00:00  50

Solution

  • The "id"s don't match; so first we create a column in both DataFrames to get a matching ID; then merge on the matching "id"s; then filter the merged DataFrame on the rows where the timestamps are between "start" and "end". Finally groupby + sum will fetch the desired outcome:

    maindata['id_letter'] = maindata['id'].str.split().str[-1]
    mst_df['id_letter'] = mst_df['id'].str.split().str[-1]
    merged = mst_df.merge(maindata, on='id_letter', suffixes=('','_'))
    out = (merged[merged['timestamp'].between(merged['start'], merged['end'])]
           .groupby(['id','start','end'], as_index=False)['value'].sum())
    

    Output:

           id               start                 end  value
    0  site a 2021-03-05 02:00:00 2021-03-05 03:00:00     25
    1  site a 2021-03-05 06:00:00 2021-03-05 08:00:00     37
    2  site b 2021-04-08 10:00:00 2021-04-08 13:00:00     50