pythonpandasdataframegroup-bypandas-rolling

How to apply a rolling window mean with group by filter in Pandas


I have the following data frame:

product_id year sales freq
1 2023 100 2
1 2022 200 3
2 2023 10 2
3 2022 600 3

I want to add another column that calculates and adds mean values to the data frame based on the following formula.

if df["freq"] = x: df["mean_sales"] = the mean of 'x' rows below and 'x' rows above the current row where the product id is the same.

I want to sort the table first by product ID and year. Then group it by product ID and then calculate the mean based on the formula above.

Here is my pseudocode that is currently not working as expected:

import pandas as pd

data = {
    "product_id": [1, 1, 2, 3],
    "year": [2023, 2022, 2023, 2022],
    "sales": [100, 200, 10, 600],
    "freq": [2, 3, 2, 3],
}

df = pd.DataFrame(data)
unique_frequencies = df["freq"].unique().tolist()

for frequency in unique_frequencies:
  df_tmp = df[[df["freq"] == frequency]].sort_values(by=["product_id","year"], ascending=True).groupby("product_id")
  df_tmp["mean_sales"] = df_tmp["sales"].rolling(window_size=frequency, closed="left").mean()

  df_results = pd.concat([df_results, df_tmp]) 

I get the following error while trying this operation: TypeError: 'DataFrameGroupBy' object does not support item assignment.

I would really appreciate if someone can help me solve this. Alternate methods are also welcome.


Solution

  • I have fixed your code and added some more data to the example so it could be better tested. The main problem with your code was applying the groupby method, it needs to be used with another operation and it is more readable if applied in another line.

    One-liner solutions for complex calculations are not always a best practice, should only be used if there is a efficiency justification.

    import pandas as pd
    
    data = {
        "product_id": [1, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
        "year": [2023, 2022, 2023, 2022, 2022, 2023, 2022, 2022, 2023, 2022, 2022, 2023, 2022],
        "sales": [100, 200, 10, 600, 200, 10, 600, 200, 10, 600, 200, 10, 600],
        "freq": [2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 3, 2, 3],
    }
    
    df = pd.DataFrame(data)
    df = df.sort_values(by=["product_id", "year"], ascending=True)
    unique_frequencies = df["freq"].unique().tolist()
    
    df_tmp_list = []
    for frequency in unique_frequencies:
        df_tmp = df[df["freq"] == frequency].copy()
    
        df_tmp["mean_sales"] = df_tmp.groupby("product_id")["sales"].rolling(frequency).sum().reset_index(drop=True)
    
        df_tmp_list.append(df_tmp) 
      
    df_results = pd.concat(df_tmp_list, ignore_index=True)