This is a question I opened as an issue but haven't heard from the package author, so I thought I would ask the question here. Thanks!
I am noticing some inconsistencies when forecasting with lagged xregs. Specifically, forecasts for h <= lag period. It seems like the historical data provided to the original model is not added to the new data before generating the forecast. In the example below I use the lag = 2 example from fpp3. The first forecast fc1
is identical to the one generated in the book. In the second forecast fc2
I augment the new_data
by binding the historical advert data with the new advert data generated in insurance_future
. When I do this I get a different forecast in fc2
vs fc1
. It seems to me like the forecast in fc1
does not have access to the historical (xreg) data, so the TVaderts is treated as NA
for the first two steps in the horizon. Is this correct? And if so, shouldn't that data be included as it is in fc2
? This might be related to.
library(fpp3)
#> ── Attaching packages ──────────────────────────────────────────── fpp3 0.4.0 ──
#> ✓ tibble 3.1.2 ✓ tsibble 1.0.1
#> ✓ dplyr 1.0.6 ✓ tsibbledata 0.3.0
#> ✓ tidyr 1.1.3 ✓ feasts 0.2.1
#> ✓ lubridate 1.7.10 ✓ fable 0.3.1
#> ✓ ggplot2 3.3.3
#> ── Conflicts ───────────────────────────────────────────────── fpp3_conflicts ──
#> x lubridate::date() masks base::date()
#> x dplyr::filter() masks stats::filter()
#> x tsibble::intersect() masks base::intersect()
#> x tsibble::interval() masks lubridate::interval()
#> x dplyr::lag() masks stats::lag()
#> x tsibble::setdiff() masks base::setdiff()
#> x tsibble::union() masks base::union()
library(fabletools)
library(fable)
library(dplyr)
library(tsibble)
fit <- insurance %>%
# Restrict data so models use same fitting period
# Estimate models
model(
lag2 = ARIMA(Quotes ~ pdq(d = 0) +
TVadverts + lag(TVadverts) +
lag(TVadverts, 2))
)
insurance_future <- new_data(insurance, 20) %>%
mutate(TVadverts = 8)
# Forecast as shown in https://otexts.com/fpp3/lagged-predictors.html
fc1 <- fit %>%
forecast(insurance_future)
# Manually pre-pend historic advert data to future data to ensure presence of
# lagged regressors
fc2 <- fit %>%
forecast(bind_rows(select(insurance, -Quotes), insurance_future)) %>%
filter_index(as.character(min(insurance_future$Month)) ~ .)
print(fc1)
#> # A fable: 20 x 5 [1M]
#> # Key: .model [1]
#> .model Month Quotes .mean TVadverts
#> <chr> <mth> <dist> <dbl> <dbl>
#> 1 lag2 2005 May N(13, 0.23) 13.0 8
#> 2 lag2 2005 Jun N(13, 0.59) 13.0 8
#> 3 lag2 2005 Jul N(13, 0.72) 13.2 8
#> 4 lag2 2005 Aug N(13, 0.72) 13.2 8
#> 5 lag2 2005 Sep N(13, 0.72) 13.2 8
#> 6 lag2 2005 Oct N(13, 0.72) 13.2 8
#> 7 lag2 2005 Nov N(13, 0.72) 13.2 8
#> 8 lag2 2005 Dec N(13, 0.72) 13.2 8
#> 9 lag2 2006 Jan N(13, 0.72) 13.2 8
#> 10 lag2 2006 Feb N(13, 0.72) 13.2 8
#> 11 lag2 2006 Mar N(13, 0.72) 13.2 8
#> 12 lag2 2006 Apr N(13, 0.72) 13.2 8
#> 13 lag2 2006 May N(13, 0.72) 13.2 8
#> 14 lag2 2006 Jun N(13, 0.72) 13.2 8
#> 15 lag2 2006 Jul N(13, 0.72) 13.2 8
#> 16 lag2 2006 Aug N(13, 0.72) 13.2 8
#> 17 lag2 2006 Sep N(13, 0.72) 13.2 8
#> 18 lag2 2006 Oct N(13, 0.72) 13.2 8
#> 19 lag2 2006 Nov N(13, 0.72) 13.2 8
#> 20 lag2 2006 Dec N(13, 0.72) 13.2 8
print(fc2)
#> # A fable: 20 x 5 [1M]
#> # Key: .model [1]
#> .model Month Quotes .mean TVadverts
#> <chr> <mth> <dist> <dbl> <dbl>
#> 1 lag2 2005 May N(14, 0.72) 13.5 8
#> 2 lag2 2005 Jun N(13, 0.72) 13.3 8
#> 3 lag2 2005 Jul N(13, 0.72) 13.2 8
#> 4 lag2 2005 Aug N(13, 0.72) 13.2 8
#> 5 lag2 2005 Sep N(13, 0.72) 13.2 8
#> 6 lag2 2005 Oct N(13, 0.72) 13.2 8
#> 7 lag2 2005 Nov N(13, 0.72) 13.2 8
#> 8 lag2 2005 Dec N(13, 0.72) 13.2 8
#> 9 lag2 2006 Jan N(13, 0.72) 13.2 8
#> 10 lag2 2006 Feb N(13, 0.72) 13.2 8
#> 11 lag2 2006 Mar N(13, 0.72) 13.2 8
#> 12 lag2 2006 Apr N(13, 0.72) 13.2 8
#> 13 lag2 2006 May N(13, 0.72) 13.2 8
#> 14 lag2 2006 Jun N(13, 0.72) 13.2 8
#> 15 lag2 2006 Jul N(13, 0.72) 13.2 8
#> 16 lag2 2006 Aug N(13, 0.72) 13.2 8
#> 17 lag2 2006 Sep N(13, 0.72) 13.2 8
#> 18 lag2 2006 Oct N(13, 0.72) 13.2 8
#> 19 lag2 2006 Nov N(13, 0.72) 13.2 8
#> 20 lag2 2006 Dec N(13, 0.72) 13.2 8
waldo::compare(fc1, fc2)
#> `old$Quotes[[1]]$mu`: 13.0
#> `new$Quotes[[1]]$mu`: 13.5
#>
#> `old$Quotes[[1]]$sigma`: 0.5
#> `new$Quotes[[1]]$sigma`: 0.8
#>
#> `old$Quotes[[2]]$mu`: 13.0
#> `new$Quotes[[2]]$mu`: 13.3
#>
#> `old$Quotes[[2]]$sigma`: 0.77
#> `new$Quotes[[2]]$sigma`: 0.85
#>
#> `old$.mean[1:5]`: 13.0 13.0 13.2 13.2 13.2
#> `new$.mean[1:5]`: 13.5 13.3 13.2 13.2 13.2
Curiously, when I create new lagged variables manually (rather than in the formula) the model results match the "base case" from fpp3 (fc1
in my example).
insurance_manlag <- insurance %>%
mutate(TVadverts1 = lag(TVadverts),
TVadverts2 = lag(TVadverts, 2))
fit <- insurance_manlag %>%
# Restrict data so models use same fitting period
# Estimate models
model(
lag2 = ARIMA(Quotes ~ pdq(d = 0) +
TVadverts + TVadverts1 + TVadverts2)
)
insurance_man_future <- append_row(insurance, n = 20) %>%
replace_na(replace = list(TVadverts = 8)) %>%
mutate(TVadverts1 = lag(TVadverts),
TVadverts2 = lag(TVadverts, 2)) %>%
slice_tail(n = 20)
# Forecast as shown in https://otexts.com/fpp3/lagged-predictors.html
fc3 <- fit %>%
forecast(insurance_man_future)
waldo::compare(fc1$Quotes, fc3$Quotes)
#> ✓ No differences
waldo::compare(fc2$Quotes, fc3$Quotes)
#> `old[[1]]$mu`: 13.5
#> `new[[1]]$mu`: 13.0
#>
#> `old[[1]]$sigma`: 0.8
#> `new[[1]]$sigma`: 0.5
#>
#> `old[[2]]$mu`: 13.3
#> `new[[2]]$mu`: 13.0
#>
#> `old[[2]]$sigma`: 0.85
#> `new[[2]]$sigma`: 0.77
Created on 2021-06-02 by the reprex package (v2.0.0)
This reproduction leads me to believe that fc1
is correct, rather than fc2
. If so, what is occurring in fc2
that causes it to have a different forecast vs that in fc1
(and fc3
)?
In {fable}
, models that produce forecasts retain all information needed to produce forecasts. When using the recommended interface to get fc1
(as shown in the book), the model is holding onto the 2 most recent values of TVadverts
. While they were not needed to estimate the model, they are required inputs to produce the first couple of forecasts.
When using the forecast()
function with new_data
, the intended behaviour is for the model to produce forecasts for each time point in new_data
. I believe forecasting from time points not at the end of the series is not yet implemented, so I will change this to produce an error.
Generally speaking, when using the lag()
function in a model formula there is no need to prepend the historical data. The models will store and recall the needed values for forecasting.