pythonpandasgroup-bypandas-rolling

Get rolling average without every timestamp


I have data about how many messages each account sends aggregated to an hourly level. For each row, I would like to add a column with the sum of the previous 7 days messages. I know I can groupby account and date and aggregate the number of messages to the daily level, but I'm having a hard time calculating the rolling average because there isn't a row in the data if the account didn't send any messages that day (and I'd like to not balloon my data by adding these in if at all possible). If I could figure out a way to calculate the rolling 7-day average for each day that each account sent messages, I could then re-join that number back to the hourly data (is my hope). Any suggestions?

Note: For any day not in the data, assume 0 messages sent.

Raw Data:

Account | Messages | Date       | Hour
12        5          2022-07-11   09:00:00
12        6          2022-07-13   10:00:00
12        10         2022-07-13   11:00:00
12        9          2022-07-15   16:00:00
12        1          2022-07-19   13:00:00
15        2          2022-07-12   10:00:00
15        13         2022-07-13   11:00:00
15        3          2022-07-17   16:00:00
15        4          2022-07-22   13:00:00

Desired Output:

Account | Messages | Date       | Hour      | Rolling Previous 7 Day Average
12        5          2022-07-11   09:00:00    0
12        6          2022-07-13   10:00:00    0.714
12        10         2022-07-13   11:00:00    0.714
12        9          2022-07-15   16:00:00    3
12        1          2022-07-19   13:00:00    3.571
15        2          2022-07-12   10:00:00    0
15        13         2022-07-13   11:00:00    0.286
15        3          2022-07-17   16:00:00    2.143
15        4          2022-07-22   13:00:00    0.429

Solution

  • I hope I've understood your question right:

    df["Date"] = pd.to_datetime(df["Date"])
    df["Messages_tmp"] = df.groupby(["Account", "Date"])["Messages"].transform(
        "sum"
    )
    
    df["Rolling Previous 7 Day Average"] = (
        df.set_index("Date")
        .groupby("Account")["Messages_tmp"]
        .rolling("7D")
        .apply(lambda x: x.loc[~x.index.duplicated()].shift().sum() / 7)
    ).values
    
    df = df.drop(columns="Messages_tmp")
    
    print(df)
    

    Prints:

       Account  Messages       Date      Hour  Rolling Previous 7 Day Average
    0       12         5 2022-07-11  09:00:00                        0.000000
    1       12         6 2022-07-13  10:00:00                        0.714286
    2       12        10 2022-07-13  11:00:00                        0.714286
    3       12         9 2022-07-15  16:00:00                        3.000000
    4       12         1 2022-07-19  13:00:00                        3.571429
    5       15         2 2022-07-12  10:00:00                        0.000000
    6       15        13 2022-07-13  11:00:00                        0.285714
    7       15         3 2022-07-17  16:00:00                        2.142857
    8       15         4 2022-07-22  13:00:00                        0.428571