statsmodelsarimasarimax

Why does the number of observations change the prediction of a sarimax model with fixed coefficients?


After training a sarimax model, I had hoped to be able to preform forecasts in future using it with new observations without having to retrain it. However, I noticed that the number of observations i use in the newly applied forecast change the predictions.

From my understanding, provided that enough observations are given to allow the autoregression and moving average to be calculated correctly, the model would not even use the earlier historic observations to inform itself as the coefficients are not being retrained. In a (3,0,1) example i would have thought it would need atleast 3 observations to apply its trained coefficients. However this does not seem to be the case and i am questioning whether i have understood the model correctly.

as an example and test, i have applied a trained sarimax to the exact same data with the initial few observations removed to test the effect of the number of rows on the prediction with the following code:

import pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX, SARIMAXResults
y = [348, 363, 435, 491, 505, 404, 359, 310, 337, 360, 342, 406, 396, 420, 472, 548, 559, 463, 407, 362, 405, 417, 391, 419, 461, 472, 535, 622, 606, 508, 461, 390, 432]
ynew = y[10:]
print(ynew)
model = SARIMAX(endog=y, order=(3,0,1))
model = model.fit()
print(model.params)
pred1 = model.predict(start=len(y), end = len(y)+7)
model2 = model.apply(ynew)
print(model.params)
pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
print(pd.DataFrame({'pred1': pred1, 'pred2':pred2}))

The results are as follows:

   pred1       pred2
0  472.246996  472.711770
1  494.753955  495.745968
2  498.092585  499.427285
3  489.428531  490.862153
4  477.678527  479.035869
5  469.023243  470.239459
6  465.576002  466.673790
7  466.338141  467.378903

Based on this, it means that if I were to produce a forecast from a trained model with new observations, the change in the number of observations itself would impact the integrity of the forecast.

What is the explanation for this? What is the standard practice for applying a trained model on new observations given the change in the number of them?

If i wanted to update the model but could not control for whether or not i had all of the original observations from the very start of my training set, this test would indicate that my forecast might as well be random numbers.


Solution

  • Main issue

    The main problem here is that you are not using your new results object (model2) for your second set of predictions. You have:

    pred2 = model.predict(start=len(ynew), end = len(ynew)+7)
    

    but you should have:

    pred2 = model2.predict(start=len(ynew), end = len(ynew)+7)
    

    If you fix this, you get very similar predictions:

          pred1       pred2
    0  472.246996  472.711770
    1  494.753955  495.745968
    2  498.092585  499.427285
    3  489.428531  490.862153
    4  477.678527  479.035869
    5  469.023243  470.239459
    6  465.576002  466.673790
    7  466.338141  467.378903
    

    To understand why they're not identical, there is a second issue (which is not a problem in your code, but just a statistical feature of your data/model).

    Secondary issue

    Your estimated parameters imply an extremely persistent model:

    print(params)
    

    gives

    ar.L1        2.134401
    ar.L2       -1.683946
    ar.L3        0.549369
    ma.L1       -0.874801
    sigma2    1807.187815
    

    with is associated with a near-unit-root process (largest eigenvalue = 0.99957719).

    What this means is that it takes a very long time for the effects of a particular datapoint on the forecast to die out. In your case, this just means that there are still small effects on the forecasts from the first 10 periods.

    This isn't a problem, it's just the way this particular estimated model works.