python time-series statsmodels autoregressive-models

Correctly interpreting statsmodels.tsa.ar_models.ar_select_order function array to determine optimal lag

Using statsmodels 0.12.0 I am attempting to determine the optimal lag for a statsmodels.tsa.ar_models.AutoReg model. I am using US population data with monthly timesteps and passing in a maximum lag of 12 to the statsmodels.tsa.ar_models.ar_select_order object to evaluate.

from statsmodels.tsa.ar_model import AutoReg, ar_select_order    
df = pd.read_csv('Data\uspopulation.csv', index_col='DATE', parse_dates=True)
df.index.freq = 'MS'
train_data = df.iloc[:84]
test_data = df.iloc[84:]
modelp = ar_select_order(train_data['PopEst'], maxlag=12)

The code above returns a numpy array of [ 1 2 3 4 5 6 7 8 9 10 11 12], which I am interpreting as "The optimal lag p is 12" as per this StackOverflow question: stackoverflow. However, evaluating on some metrics (RMSE) I find that my AutoReg fitted models with maxlag=12 are performing worse than lower order models. By trial and error I found that the optimal lag is 8. So I am having difficulty interpreting the resulting numpy array, I have been reading the resources on statsmodels.com/ar_select_order and statsmodels.com/autoregressions but they have not made it clearer.

Does anyone here have any input? New to this python library and felling a bit lost.

Solution

The code above returns a numpy array of [ 1 2 3 4 5 6 7 8 9 10 11 12], which I am interpreting as "The optimal lag p is 12" as per this StackOverflow question: stackoverflow.

Yes, that's right. The reason it returns an array instead of just 12 is that it can also search for models that do not include all of the lags, if you set glob=True. For example [ 1 2 3 12] might be a common result for a monthly model that has some annual seasonal pattern.

However, evaluating on some metrics (RMSE) I find that my AutoReg fitted models with maxlag=12 are performing worse than lower order models. By trial and error I found that the optimal lag is 8. So I am having difficulty interpreting the resulting numpy array, I have been reading the resources on statsmodels.com/ar_select_order and statsmodels.com/autoregressions but they have not made it clearer.

This function is returning the model that is judged optimal using information criteria. In particular, the default is BIC or Bayesian information criterion. If you use other criteria, such as minimizing the out-of-sample RSME, then it is definitely possible to find that a different model is judged to be optimal.