I've recently encountered a problem with predict()
function for plm
regression that uses AR(1) term in form of lag()
.
Here is an toy example: 3 countries, 3 periods, current population as a function a lagged population, the models are estimated with pooled OLS.
library(plm)
data <- data.frame(country = rep(c("A", "B", "C"), each = 3),
year = rep(c(1, 2, 3), 3),
population = c(100, 150, 200, 1000, 1200, 1250, 10, 20, 45)) |>
pdata.frame(index = c('country', 'year'))
data$lag_population <- plm::lag(data$population, k = 1)
lm_model <- lm(population ~ lag_population, data)
plm_model <- plm(population ~ plm::lag(population, k = 1), data, model = 'pooling')
plm_model2 <- plm(population ~ lag_population, data, model = 'pooling')
All three models are identical:
> print(lm_model)
Call:
lm(formula = population ~ lag_population, data = data)
Coefficients:
(Intercept) lag_population
31.633 1.079
+ print(plm_model)
Model Formula: population ~ plm::lag(population, k = 1)
Coefficients:
(Intercept) plm::lag(population, k = 1)
31.6332 1.0787
+ print(plm_model2)
Model Formula: population ~ lag_population
Coefficients:
(Intercept) lag_population
31.6332 1.0787
Yet, their forecasts for the same data are different:
lm()
based forecast is exactly what is expected, with NA
in first periods for each cross-section (b/c first periods don't have lagged observations)plm()
forecast that uses generated variable lag_population
instead of the lag()
function, is similar to that from lm()
, yet, NA
s are droppedplm()
forecast that uses lag()
function is the most problematic: values are shifted 1 period back, plus some values are generated for last periods (the model is pooled, without fixed effects, so how are the forecasts generated for 1st periods, especially that fill.na = FALSE
is used?)predict_plm <- predict(plm_model, newdata = data, na.fill = FALSE)
predict_plm2 <- predict(plm_model2, newdata = data, na.fill = FALSE)
predict_lm <- predict(lm_model, newdata = data)
> print(predict_lm)
A-1 A-2 A-3 B-1 B-2 B-3 C-1 C-2 C-3
NA 139.50424 193.43973 NA 1110.34313 1326.08511 NA 42.42035 53.20745
+ print(predict_plm2)
A-2 A-3 B-2 B-3 C-2 C-3
139.50424 193.43973 1110.34313 1326.08511 42.42035 53.20745
+ print(predict_plm)
A-1 A-2 A-3 B-1 B-2 B-3 C-1 C-2 C-3
139.50424 193.43973 247.37522 1110.34313 1326.08511 1380.02060 42.42035 53.20745 80.17519
Has anyone encountered a similar issue?
How can I make the plm predictions look the same as those based on lm()
?
This is fixed already in the development version of plm
, see github.com/ycroissant/plmhttps://github.com/ycroissant/plm/
Install the development version by, e.g.:
# install.packages("remotes") # remove '#' if pkg 'remotes' is not installed
remotes::install_github("ycroissant/plm")
For your example, it yields:
predict(plm_model, newdata = data)
## A-2 A-3 B-2 B-3 C-2 C-3
## 139.50424 193.43973 1110.34313 1326.08511 42.42035 53.20745
Regards the case with NA
s: currently, NA
-padding (via na.action
) is not supported. So, plm's predict
corresponds to lm's predict with these arguments:
predict(lm_model, newdata = data, na.action = na.omit)
## A-2 A-3 B-2 B-3 C-2 C-3
## 139.50424 193.43973 1110.34313 1326.08511 42.42035 53.20745