dataset sample: https://pastebin.com/DeLhAsKL
if df['Actual_v_Predicted'] == "Predicted"
, it is meant to do a rolling mean average of the last 7 values and forward fill the ratio column, however, this keeps breaking after reaching 01/07/2025.
the way I've done this atm
is
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan,df['ratio'] )
df['ratio'] = df['ratio'].fillna(df['ratio'].rolling(window=7, min_periods=1).mean())
I'm not sure why its breaking, but any help would be appreciated.
There are several things you might need to tackle:
df = df.sort_values(by='Date')
The rolling mean is computed on previous values within a defined window. If all previous values are NaN
, the rolling mean will also produce NaN
. You may need to forward-fill after applying the rolling mean.
So, instead of directly computing the rolling mean and assigning it back, you can try forward-filling explicitly for missing values after rolling mean:
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])
# Compute rolling mean only on 'Actual' rows (already handled by np.where)
rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
# Fill in 'Predicted' rows using rolling mean and then forward fill if needed
df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
After 01/07/2025, if there are no "Actual" values to base the rolling mean on, you'll get NaN
. In such cases, you may need to handle long gaps or future dates more carefully. One option is to always fill missing values forward:
df['ratio'] = df['ratio'].ffill()
I have tried the following code with your dataset and it executed with no issues:
import pandas as pd
import numpy as np
df = pd.read_csv('dataset.txt', sep='\t')
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%Y')
df = df.sort_values(by='Date')
df['ratio'] = np.where(df['Actual_v_Predicted'] == "Predicted", np.nan, df['ratio'])
rolling_mean = df['ratio'].rolling(window=7, min_periods=1).mean()
df['ratio'] = df['ratio'].fillna(rolling_mean).ffill()
df.to_csv('updated_file.txt', sep='\t', index=False)
print(df)