python statsmodels autoregressive-models

Statsmodels AutoRegression backtesting code validity

I am learning about autoregressive models in Python using the stats models library.

What I am doing is taking a dataset which shows the returns for a financial stock from here:

The data looks as follows when I run the following command:

data['y'] = data['close'].shift(-1).astype(float)
data.dropna(inplace=True)
data.tail()

Basically the y column is the close for the next time step. I have 500 time steps altogether in my dataset.

Now I want to fit a simple AR model, predicting next periods close using the current close. To test the fit of my model, I did a backtest.

from statsmodels.tsa.arima_model import AutoReg

def backtest(num_periods, data):
    predictions = []
    true_values = []
    x = data[['open', 'high', 'low', 'close']]
    y = data['y']
    for i in reversed(range(1, num_periods)):
        # split the data into training and test splits
        # the y_test variable should be a single value for the next period out of the sample
        x_train = x.iloc[:len(x)-i]
        y_train = y.iloc[:len(y)-i]
        x_test = x.iloc[len(x)-i]
        y_test = y.iloc[len(y)-i]
        # fit the model on the endogenous variables
        model = AutoReg(endog=x_train.close.astype(float), lags = 13).fit()
        # forecast for the period out of the 
        pred = model.predict(start=len(x_train), end=len(x_train)+1)
        # create the prediction and true value arrays
        predictions.append(pred)
        true_values.append(y_test)
    return true_values, predictions

true, pred = backtest(10, data)

But this gives me two series for prediction:

plt.plot(true, label='true', );
plt.plot(pred, label = 'pred', );
plt.legend();

What is going on here? And is my method for backtesting the AR model correct? My main worry is it seems I am training the model on y, but y is from the next period. So when I predict out of sample, it is taking in a value from the test set.

Any guidance would be much appreciated with code examples.

Solution

There are multiple things that should be addressed here. First of all, using an Autoregressive model for the forecast of a stock or any high liquidity market shouldn't be done. This is because an autoregressive model will take into account the values of the n'th lag and if the data isn't cyclical it won't be useful, I'll explain this in further detail:

Consider the efficient market hypothesis, this states that the market will take into account all of the known information into the price of an asset. For forecasting, this means that if everyone knew that the price would go up on a tuesday and dip on fridays, people would buy on mondays and sell on thursday, thus moving the stock price back to an equilibrium.

If your aim is to forecast the market accurately, there have been some attempts using the LSTM neural network models that have had some kind of success in doing so.

Regarding the validation method, it seems to be ok, but there's a way where with added flexibility you can validate your code where in each iteration you get the results with a validation metric directly for each iteration, so in this case you just use the dataset and a test size of the number of rows added, in this case, days.

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error

def test_train_spl(data, testsize):

    test = data.tail(testsize)
    train = data.head(data.shape[0] - testsize)
    return test, train





def walkforward_validation(data, test_start_date, test_end_date=None, step_size=15, testsize=15, model='SARIMA'):

    test_start_date = pd.to_datetime(test_start_date)
    current_max_date = test_start_date

    modelling_results = pd.DataFrame(columns=['test_start', 'test_end', 'MAE', 'MAPE'])

    if test_end_date is None:
        test_end_date = data.index.max()
        test_end_date = pd.to_datetime(test_end_date)
    else:
        test_end_date = pd.to_datetime(test_end_date)

    while current_max_date < test_end_date:
        data.index = pd.to_datetime(data.index)
        iter_data = data[data.index <= current_max_date + timedelta(days=testsize)]
        test, train = test_train_spl(iter_data, testsize=testsize)

         # fit the model on the endogenous variables
        model = AutoReg(endog=x_train.close.astype(float), lags = 13).fit()
        # forecast for the period out of the 
        pred = model.predict(start=len(x_train), end=len(x_train)+1)
        # create the prediction and true value arrays
    
        mae=mean_absolute_error(y_test, pred)
        mape=mean_absolute_error(y_test, pred)
    
        iter_results = pd.DataFrame({'test_start': [current_max_date],'test_end': [current_max_date + timedelta(testsize)], 'MAE': [mae], 'MAPE': [mape]})
        modelling_results = modelling_results.append(iter_results, ignore_index=True)

        #add the step size to the current date analized and continue the while loop until it is over
        current_max_date = current_max_date + timedelta(days=step_size)


    return modelling_results