I am confused about the difference in the fitting results of the Arima()
function and glm()
function.
I want to fit an AR(1) model with an exogeneous variable. Here is the equation:
$$
x_{t} = \alpha_{0} + \alpha_{1}x_{t-1} + \beta_{1}z_{t} + \epsilon_{t}
$$
Now I estimate this model using the Arima()
function and glm()
function and compare the results, but the results turned out to be quite different!
Here is the sample data. x denotes the time-series variable, and z denotes the exogeneous variable, as shown in the equation above.
library(forecast)
library(tidyverse)
data("Nile")
df <-
Nile %>%
as_tibble() %>%
mutate(x = as.numeric(x)) %>%
mutate(z = rnorm(100))
Then fit the model using the Arima()
and glm()
and compare the results.
fit_arima <- Arima(df$x, order = c(1, 0, 0), include.mean = TRUE, xreg = df$z)
tibble(Parameters = c("x lag", "intercept", "z"),
Coefficients = coef(fit_arima),
Standard_Errors = sqrt(diag(vcov(fit_arima))))
fit_glm <- glm(df$x ~ lag(df$x) + df$z)
tibble(Parameters = c("intercept", "x lag", "z"),
Coefficients = coef(fit_glm),
Standard_Errors = summary(fit_glm)$coefficients[, "Std. Error"])
The results are displayed as follows.
Arima()
function:
# A tibble: 3 × 3
Parameters Coefficients Standard_Errors
<chr> <dbl> <dbl>
1 x lag 0.510 0.0868
2 intercept 920. 29.4
3 z 5.02 12.1
glm()
function:
# A tibble: 3 × 3
Parameters Coefficients Standard_Errors
<chr> <dbl> <dbl>
1 intercept 444. 83.4
2 x lag 0.516 0.0896
3 z 8.95 13.9
The estimated coefficient and standard error of x lag are quite close, but the values of other two variables are very different. I find this puzzling because both the Arima()
and glm()
function use the maximum likelihood estimator. Could you please explain why this difference happens and how can I fix this?
First, Arima()
does not fit the model given in your equation. It fits a regression with ARIMA errors like this:
x_{t} = \alpha_{0} + \beta_{1}z_{t} + \eta_{t}
where
\eta_t = \phi_{1}\eta_{t-1}+\varepsilon_{t}.
We can rearrange this to give
x_{t} = (1-\phi_{1})\alpha_{0} + \phi_{1}x_{t-1} + \beta_{1}z_{t} - \beta_{1}\phi_{1}z_{t-1} + \varepsilon_{t}
This explains the major differences in the two results.
But even if you specified exactly the same model, they would give slightly different results because Arima()
uses the true likelihood whereas glm()
will use a conditional likelihood because of the initial missing value due to the lag()
function.
See https://robjhyndman.com/hyndsight/arimax/ for a discussion of the different model specifications.