I'm trying to test ARMA models, and working through the examples provided here:
http://www.statsmodels.org/dev/examples/notebooks/generated/tsa_arma_0.html
I can't tell if there is a straightforward way to train a model on a training dataset then test it on a test dataset. It seems to me that you have to fit the model on an entire dataset. Then you can do in-sample predictions, which use the same dataset as you used to train the model. Or you can do an out of sample prediction, but that has to start at the end of your training dataset. What I would like to do instead is fit the model on a training dataset, then run the model over an entirely different dataset that wasn't part of the training dataset and get a series of 1 step ahead predictions.
To illustrate the issue, here is abbreviated code from the link above. You see that the model is fitting data for 1700-2008 then predicting 1990-2012. The problem I have is that 1990-2008 were already part of the data that was used to fit the model, so I think I'm predicting and training on the same data. I want to be able to get a series of 1 step predictions that don't have look-ahead bias.
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
dta = sm.datasets.sunspots.load_pandas().data
dta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1700', '2008'))
dta = dta.drop('YEAR',1)
arma_mod30 = sm.tsa.ARMA(dta, (3, 0)).fit(disp=False)
predict_sunspots = arma_mod30.predict('1990', '2012', dynamic=True)
fig, ax = plt.subplots(figsize=(12, 8))
ax = dta.ix['1950':].plot(ax=ax)
fig = arma_mod30.plot_predict('1990', '2012', dynamic=True, ax=ax, plot_insample=False)
plt.show()
In the 16 months since I asked this question, I've learned a lot more about ARIMA modeling in statsmodels, and I think that the behavior I'm looking for isn't supported for the ARMA or ARIMA model, but it is supported in the SARIMAX model. See below code, based on the examples from statsmodels.org. The green line represents an ARIMA(10,0,0) model (or AR(10)) model that was trained from 1700-1990, and then dynamically predicted from 1990-2012.
https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_sarimax_stata.html
import pandas
import matplotlib.pyplot as plt
import statsmodels.api as sm
dta = sm.datasets.sunspots.load_pandas().data
dta.index = pandas.Index(sm.tsa.datetools.dates_from_range('1700', '2008'))
dta = dta.drop('YEAR', 1)
arma_mod30 = sm.tsa.ARMA(dta, (3, 0)).fit(disp=False)
predict_sunspots = arma_mod30.predict('1990', '2012', dynamic=True)
fig, ax = plt.subplots(figsize=(12, 8))
ax = dta.ix['1950':].plot(ax=ax)
fig = arma_mod30.plot_predict('1990', '2012', dynamic=True, ax=ax, plot_insample=False)
# Fit the model
mod = sm.tsa.statespace.SARIMAX(dta.loc[:'1990'], order=(10, 0, 0))
fit_res = mod.fit(disp=False)
# Create new model, but instead of fit, copy the params from the first model
mod = sm.tsa.statespace.SARIMAX(dta, order=(10, 0, 0))
res = mod.filter(fit_res.params)
# Dynamic predictions
predict_dy = res.get_prediction(dynamic='1990', end='2012')
predict_dy = predict_dy.predicted_mean
predict_dy['1990':].plot(ax=ax)
plt.show()