New to ARIMA and attempting to model a dataset in Python using auto ARIMA. I'm using auto-ARIMA as I believe it will be better at defining the values of p, d and q however the results are poor and I need some guidance. Please see my reproducible attempts below
Attempt as follows:
# DEPENDENCIES
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pmdarima as pm
from pmdarima.model_selection import train_test_split
from statsmodels.tsa.stattools import adfuller
from pmdarima.arima import ADFTest
from pmdarima import auto_arima
from sklearn.metrics import r2_score
# CREATE DATA
data_plot = pd.DataFrame(data removed)
# SET INDEX
data_plot['date_index'] = pd.to_datetime(data_plot['date']
data_plot.set_index('date_index', inplace=True)
# CREATE ARIMA DATASET
arima_data = data_plot[['value']]
arima_data
# PLOT DATA
arima_data['value'].plot(figsize=(7,4))
The above steps result in a dataset that should look like this.
# Dicky Fuller test for stationarity
adf_test = ADFTest(alpha = 0.05)
adf_test.should_diff(arima_data)
Result = 0.9867 indicating non-stationary data which should be handled by appropriate over of differencing later in auto arima process.
# Assign training and test subsets - 80:20 split
print('Dataset dimensions;', arima_data.shape)
train_data = arima_data[:-24]
test_data = arima_data[-24:]
print('Training data dimension:', train_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
print('Test data dimension:', test_data.shape, round((len(train_data)/len(arima_data)*100),2),'% of dataset')
# Plot training & test data
plt.plot(train_data)
plt.plot(test_data)
# Run auto arima
arima_model = auto_arima(train_data, start_p=0, d=1, start_q=0,
max_p=5, max_d=5, max_q=5,
start_P=0, D=1, start_Q=0, max_P=5, max_D=5,
max_Q=5, m=12, seasonal=True,
stationary=False,
error_action='warn', trace=True,
suppress_warnings=True, stepwise=True,
random_state=20, n_fits=50)
print(arima_model.aic())
Output suggests best model is 'ARIMA(1,1,1)(0,1,0)[12]'
with AIC 1725.35484
#Store predicted values and view resultant df
prediction = pd.DataFrame(arima_model.predict(n_periods=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction
# Plot prediction against test and training trends
plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.show()
# Finding r2 model score
test_data['predicted_value'] = prediction
r2_score(test_data['value'], test_data['predicted_value'])
Result: -6.985
Is auto_arima
a method done by you? It depends how you differentiate and what you do there. Did you check the autocorrelation and partial autocorrelation to know which repeating time lags you have there?
Also, it seems you have some seasonality patterns every year, you could try a SARIMA model if you are not doing it already.
To try a SARIMA model you have to:
data_stationarized = train_data.diff()[1:]
statsmodels
for this.import statsmodels.api as sm
sm.graphics.tsa.plot_acf(data_stationarized);
You can see that the most prominent flag is the twelfth flag, so as the granularity of the data is by month, that means there is prominent seasonality pattern every 12 months.
We can check the partial autocorrelation to confirm it too:
sm.graphics.tsa.plot_pacf(data_stationarized);
Again the most prominent flag is the twelfth one.
model = sm.tsa.SARIMAX(endog=train_data, order=(2,0,0), seasonal_order=(2,0,0,12))
model_fit = model.fit()
from sklearn.metrics import mean_squared_error
y_pred = model_fit.forecast(steps=24)
# when squared=False then is equals to RMSE
mean_squared_error(y_true=test_data.values, y_pred=y_pred, squared=False)
This outputs 12063.88
, which you can use to compare different results more rigorously.
For a graphical check:
prediction = pd.DataFrame(model_fit.forecast(steps=25), index=test_data.index)
prediction.columns = ['predicted_value']
prediction
# Plot prediction against test and training trends
plt.figure(figsize=(7,4))
plt.plot(train_data, label="Training")
plt.plot(test_data, label="Test")
plt.plot(prediction, label="Predicted")
plt.legend(loc='upper right')
plt.xticks([])
plt.yticks([])
plt.show();
Now you can see that the predictions get closer to the expected values.
You could continue fine tuning the order and seasonal order to get even better results, I will advice to check the docs of statsmodel.
Another advice it's to analyze the autocorrelation and partial autocorrelation of the residuals to check if your model is capturing all of the patterns. You have them in the model_fit
object.