I'm trying to understand something about autoregressive time series forecasting models. This is probably going to seem like a basic question to anyone who knows what they are doing. I'm trying to dig into the theory in order to learn how it works, instead of just implementing a forecasting model without knowing what it's doing.
Here are the libraries that I'm using:
library(tidyverse)
library(tsibble)
Here is a toy dataframe I set up.
date <- c("2022-01-01", "2022-01-02", "2022-01-03", "2022-01-04", "2022-01-05", "2022-01-06", "2022-01-07", "2022-01-08", "2022-01-09", "2022-01-10", "2022-01-11", "2022-01-12", "2022-01-13", "2022-01-14")
sales <- c(10, 40, 30, 20, 50, 20, 10, 20, 10, 30, 60, 10, 10, 50)
forecast_table <- data.frame(date, sales)
forecast_table$date <- as.Date(forecast_table$date)
forecast_table <- as_tsibble(forecast_table)
I can graph it using the autoplot function. It looks pretty random to me.
autoplot(forecast_table, sales)
I can now try a simple autoregressive model with one term. It's my understanding I leave the d
and q
terms alone in order to implement an AR model.
forecast_table %>%
model(ARIMA(sales ~ pdq(1,0,0))) %>%
forecast(h = 1)
The forecast I get for the next step ahead is 20.77587.
Now I'm trying to see what's happening under the hood of that prediction so I'm building up the regression myself. Here is where I must be doing something wrong.
I first create the lag of the value.
forecast_table <- forecast_table %>%
mutate(lag1_sales = lag(sales, 1))
I then create a model with the lags as the x
predictor.
ar1_model <- lm(sales ~ lag1_sales, forecast_table)
I get these terms:
Coefficients:
(Intercept) lag1_sales
33.3333 -0.2292
But when I plug in the most recent lagged value (10) into the equation:
33.3333 + (-0.2292 * 10)
I get a value of 31.0413, not 20.77587. What gives?
Most recent lagged value is 50 not 10, as you are forecasting the "next" value. Also, ARIMA uses a combination of maximum likelihood (ML) and conditional sum of squares (CSS). So, specifying "CSS" as fitting method:
result <- forecast_table %>%
model(ARIMA(sales ~ pdq(1,0,0), method = "CSS") ) %>%
forecast(h = 1)
result
# .model date sales .mean
# <chr> <date> <dist> <dbl>
# "ARIMA(sales ~ pdq(1, 0, 0), met… 2022-01-15 N(22, 288) 21.9
result$.mean
# [1] 21.866
and:
33.3333 + (-0.2292 * 50)
# 21.875
which is almost the same.