python statsmodels arima autoregressive-models stochastic-process

SARIMAX simulation of possible paths

I am trying to create a simulation of possible paths of a stochastic process, which is not anchored to any particular point. E.g. fit SARIMAX model to weather temperature data and then use the model to make a simulation of the temperature.

Here I use the standard demonstration from statsmodels page as a simpler example:

import numpy as np
import pandas as pd
from scipy.stats import norm
import statsmodels.api as sm
import matplotlib.pyplot as plt
from datetime import datetime
import requests
from io import BytesIO

Fitting the model:

wpi1 = requests.get('https://www.stata-press.com/data/r12/wpi1.dta').content
data = pd.read_stata(BytesIO(wpi1))
data.index = data.t
# Set the frequency
data.index.freq="QS-OCT"

# Fit the model
mod = sm.tsa.statespace.SARIMAX(data['wpi'], trend='c', order=(1,1,1))
res = mod.fit(disp=False)
print(res.summary())

Creating simulation:

res.simulate(len(data),  repetitions=10).plot();

Here is the history:

Here is the simulation:

The simulated curves are so widely distibuted and apart from each other that this cannot make sense. The initial historical process doesn't have that much of a variance. What do I understand wrongly? How to perform the right simulation?

Solution

When you don't pass an initial state, it uses the first predicted state to start the simulation along with its predicted covariance. Since there is no information available to make the first prediction, it uses a diffuse prior with a variance of 1,000,000. This is why you are getting the wide range in your time series. A simple solution is to pass your own initial state using the smoothed_state.

Taking your code above, but using

initial = res.smoothed_state[:, 0]
res.simulate(len(data),
             repetitions=10,
             initial_state=initial).plot()

I get a plot that looks like

The first value is what really matters in this model, and is 30.6. You could add some randomness here directly by drawing the initial state from another (sensible) distribution. The default distribution is not sensible for simulation since it has a diffuse prior (it is, however, very sensible for estimation).

Other Notes

One other small note: You should not use trend="c" with d=1. You should instead use trend="t" when d=1 so that the model includes a drift. The model you estimate should be

mod = sm.tsa.statespace.SARIMAX(data["wpi"], trend="t", order=(1, 1, 1))

I used this model in the picture above to capture the positive trend in the data.