pythonpandasdataframemoving-average

Rolling average for forward values keeps breaking


dataset sample: https://pastebin.com/DeLhAsKL

if df['Actual_v_Predicted'] == "Predicted", it is meant to do a rolling mean average of the last 7 values and forward fill the ratio column, however, this keeps breaking after reaching 01/07/2025.

the way I've done this atm is

df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan,df['ratio'] )
df['ratio'] = df['ratio'].fillna(df['ratio'].rolling(window=7, min_periods=1).mean())

I'm not sure why its breaking, but any help would be appreciated.


Solution

  • There are several things you might need to tackle:

    1. Sorting date values
    df = df.sort_values(by='Date')
    
    1. Rolling Window Forward Fill

    The rolling mean is computed on previous values within a defined window. If all previous values are NaN, the rolling mean will also produce NaN. You may need to forward-fill after applying the rolling mean.

    So, instead of directly computing the rolling mean and assigning it back, you can try forward-filling explicitly for missing values after rolling mean:

    df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])
    
    # Compute rolling mean only on 'Actual' rows (already handled by np.where)
    rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
    
    # Fill in 'Predicted' rows using rolling mean and then forward fill if needed
    df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
    
    1. Might need to date format
    df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
    
    1. Check the edge cases

    After 01/07/2025, if there are no "Actual" values to base the rolling mean on, you'll get NaN. In such cases, you may need to handle long gaps or future dates more carefully. One option is to always fill missing values forward:

    df['ratio'] = df['ratio'].ffill()
    

    I have tried the following code with your dataset and it executed with no issues:

    import pandas as pd
    import numpy as np
    
    df = pd.read_csv('dataset.txt', sep='\t')
    df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
    
    df = df.sort_values(by='Date')
    df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])
    
    rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
    df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
    
    df.to_csv('updated_file.txt', sep='\t', index=False)
    
    print(df)