pythontime-seriesstatsmodelssarimax

SARIMAX out of sample forecast with exogenous data


I am working on a timeseries analysis with SARIMAX and have been really struggling with it.

I think I have successfully fit a model and used it to make predictions; however, I don't know how to make out of sample forecast with exogenous data.

I may be doing the whole thing wrong so I have included my steps below with some sample data;

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from pandas import datetime
import statsmodels.api as sm

# Defining Sample data
df = pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
                         '2019-01-04','2019-01-05','2019-01-06',
                         '2019-01-07','2019-01-08','2019-01-09',
                         '2019-01-10','2019-01-11','2019-01-12'],
                  'price':[78,60,62,64,66,68,70,72,74,76,78,80],
                 'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
                })
# Changing index to datetime
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])

df = df.set_index('date')
df.index = pd.to_datetime(df.index)
df.sort_index(inplace=True)
df.dropna(inplace=True)

# Splitting Data into test and training sets manually
train = df.loc['2019-01-01':'2019-01-09']
test = df.loc['2019-01-10':'2019-01-12']

# setting index to datetime for test and train datasets
train.index = pd.DatetimeIndex(train.index).to_period('D')
test.index = pd.DatetimeIndex(test.index).to_period('D')

# Defining and fitting the model with training data for endogenous and exogenous data

model=sm.tsa.statespace.SARIMAX(train['price'],
                                order=(0, 0, 0),
                                seasonal_order=(0, 0, 0,12), 
                                exog=train.iloc[:,1:],
                                time_varying_regression=True,
                                mle_regression=False)
model_1= model.fit(disp=False)

# Defining exogenous data for testing 
exog_test=test.iloc[:,1:]

# Forecasting out of sample data with exogenous data
forecast = model_1.forecast(3, exog=exog_test)

so my problem is really with the last line, what do I do if I want more than 3 steps?


Solution

  • I would attempt to answer this question as it mainly relates to the type of data and documentation about statsmodels package.

    As per the documentation the 'steps' are an integer, the number of steps to forecast from the end of the sample. That also means if you are interested in getting more than three steps you need to provide additional array data for training and TESTING data (note - both). (https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html) (https://www.statsmodels.org/dev/generated/statsmodels.tsa.statespace.sarimax.SARIMAXResults.forecast.html)

    Here are two errors I get when I increase step size by one:

    ValueError: cannot reshape array of size 3 into shape (4,1) Provided exogenous values are not of the appropriate shape. Required (4, 1), got (3, 1).

    ValueError: the number of rows in the exogenous variable does not match the number of time periods you're asking it to predict

    With that said simply expanding the testing set works well and gets you additional forecasts here is the code that works and the working notebook link:

    https://colab.research.google.com/drive/1o9KXAe61EKH6bDI-FJO3qXzlWjz9IHHw?usp=sharing

    import pandas as pd
    import numpy as np
    # from sklearn.model_selection import train_test_split 
    # why import this if you want to do tran/test manually? 
    from pandas import datetime
    
    # Defining Sample data
    df=pd.DataFrame({'date':['2019-01-01','2019-01-02','2019-01-03',
                             '2019-01-04','2019-01-05','2019-01-06',
                             '2019-01-07','2019-01-08','2019-01-09',
                             '2019-01-10','2019-01-11','2019-01-12'],
                      'price':[78,60,62,64,66,68,70,72,74,76,78,80],
                     'factor1':[178,287,152,294,155,245,168,276,165,275,178,221]
                    })
    # Changing index to datetime
    df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
    select_dates = df.set_index(['date'])
    
    df = df.set_index('date')
    df.index = pd.to_datetime(df.index)
    df.sort_index(inplace=True)
    df.dropna(inplace=True)
    
    # Splitting Data into test and training sets manually
    train=df.loc['2019-01-01':'2019-01-09']
    # I made a change here #CHANGED 10 to 09 so one more month got added
    # that means my input array is now 4,1 (if you add a column array is - ) 
    # (4,2) 
    # I can give any step from -4,0,4 (integral)
    
    test=df.loc['2019-01-09':'2019-01-12']
    
    # setting index to datetime for test and train datasets
    train.index = pd.DatetimeIndex(train.index).to_period('D')
    test.index = pd.DatetimeIndex(test.index).to_period('D')
    
    # Defining and fitting the model with training data for endogenous and exogenous data
    import statsmodels.api as sm
    
    model=sm.tsa.statespace.SARIMAX(train['price'],
                                    order=(0, 0, 0),
                                    seasonal_order=(0, 0, 0,12), 
                                    exog=train.iloc[:,1:],
                                    time_varying_regression=True,
                                    mle_regression=False)
    model_1= model.fit(disp=False)
    
    # Defining exogenous data for testing 
    exog_test=test.iloc[:,1:]
    
    # Forcasting out of sample data with exogenous data
    forecast = model_1.forecast(4, exog=exog_test)