What is the correct way to calculate 1 step forecast errors over a test set in R using the fable package?

I'm struggling to understand the correct way to calculate 1-step forecast errors over a test set using the {fable} package for R.

First, my understanding of a 1-step forecast error is that we:

at time t, we make a prediction for the next time period, t+1
take observation for t+1, calculate the residual, etc
make a forecast for t+2 based on the new data we observed at t+1 (using the same model/coefs that we did at time t)
repeat steps 1-3 over the test set.

I've had a look at this post: Does the forecast function within fable provide one-step forecasts? which aligns with the method discussed in FPP3, but I'm getting results which do not match my intuition. This method also fails when the test set is smaller than the seasonality window of the fitted models. Additionally, what do you do when there is no refit() method available for a forecast algorithm? (such as the provided THETA() technique).

Below is a code snipped with 2 different ways I believe should calculate 1-step forecast errors over a 24m test set after a model has been fit. The first is based on the method in the post linked above and the second is a loop that I came up with.

The accuracy and forecasts are produced for both methods, but they differ by a not-insignificant amount. Why are they different? Which one is correct?

In general, if I have a model and make a prediction for t+1, observe y_{t+1}, it does not feel clear how I would use the trained model and the new observation to make a prediction for t+2.

# Init
library(tidyverse)
library(fable)
#> Loading required package: fabletools
    
# Prepare data
us_accidental_deaths <- as_tsibble(USAccDeaths)
deaths_train <- head(us_accidental_deaths, -24)
deaths_test <- tail(us_accidental_deaths, 24)

# Get models on training data
deaths_fits_0 <- deaths_train %>% 
    model(ets = ETS(value))

##############
# Normal way #
##############

# 'refit' without estimating new params/coefs on the new data
deaths_fits_1 <- deaths_fits_0 %>% 
    refit(deaths_test, reestimate = FALSE)  
    # what happens here if the test set is smaller than the seasonality windows?

# 1-step forecast accuracy on the test set?
deaths_fits_1 %>%
    accuracy()
#> # A tibble: 1 x 10
#>   .model .type       ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
#>   <chr>  <chr>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1 ets    Training  30.4  142.  111. 0.334  1.24 0.428 0.416 -0.105

# The 1-step forecasts?
fitted(deaths_fits_1)
#> # A tsibble: 24 x 3 [1M]
#> # Key:       .model [1]
#>    .model    index .fitted
#>    <chr>     <mth>   <dbl>
#>  1 ets    1977 Jan   7770.
#>  2 ets    1977 Feb   6906.
#>  3 ets    1977 Mar   7680.
#>  4 ets    1977 Apr   8135.
#>  5 ets    1977 May   8964.
#>  6 ets    1977 Jun   9264.
#>  7 ets    1977 Jul  10457.
#>  8 ets    1977 Aug   9547.
#>  9 ets    1977 Sep   8520.
#> 10 ets    1977 Oct   8660.
#> # ... with 14 more rows

################
# Using a loop #
################

# Initialise fit
death_fits_2 <- deaths_fits_0
# List to store 1 step forecasts in
test_forecasts_list <- list()

# Begin loop
for (i in 1:length(unique(deaths_test$index))){
    # 1 step forecast
    one_step_fc <- death_fits_2 %>%
        forecast(h = 1)
    # store
    test_forecasts_list[[i]] <- one_step_fc
    # refit using the newly observed datapoint
    death_fits_2 <- deaths_fits_0 %>%
        refit(
            bind_rows(
                deaths_train,
                deaths_test %>% 
                    arrange(index) %>%
                    slice_head(n=i)
            ),
            reestimate=FALSE
        )
}
test_forecasts_2 <- bind_rows(test_forecasts_list)

# 1-step forecast accuracy over test set
test_forecasts_2 %>%
    accuracy(data=us_accidental_deaths) 
#> # A tibble: 1 x 10
#>   .model .type    ME  RMSE   MAE   MPE  MAPE  MASE RMSSE   ACF1
#>   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1 ets    Test   37.0  255.  194. 0.308  2.19 0.352 0.380 -0.305

# 1-step forecasts
test_forecasts_2
#> # A fable: 24 x 4 [1M]
#> # Key:     .model [1]
#>    .model    index           value  .mean
#>    <chr>     <mth>          <dist>  <dbl>
#>  1 ets    1977 Jan N(7837, 110265)  7837.
#>  2 ets    1977 Feb  N(7129, 95555)  7129.
#>  3 ets    1977 Mar  N(7783, 93292)  7783.
#>  4 ets    1977 Apr  N(7912, 91760)  7912.
#>  5 ets    1977 May  N(8894, 89411)  8894.
#>  6 ets    1977 Jun  N(9418, 87442)  9418.
#>  7 ets    1977 Jul N(10072, 85241) 10072.
#>  8 ets    1977 Aug  N(9891, 89234)  9891.
#>  9 ets    1977 Sep  N(8388, 93841)  8388.
#> 10 ets    1977 Oct  N(8679, 91768)  8679.
#> # ... with 14 more rows

^{Created on 2022-05-31 by the reprex package (v2.0.1)}

Solution

There are two groups of parameters in an ETS model, the initial states and the smoothing parameters. It rarely makes any sense to use the original initial states and apply them to a different time period, so by default refit() re-estimates the initial states but keeps the smoothing parameters in the model provided.

In your first attempt, the initial states are being re-estimated at the start of the test data, while in your loop, the initial states are being re-estimated in every iteration.

What you probably want is not to re-estimate anything, but to see how well your model is doing on the test set. You can do that by refitting your model to the training and test set combined, and then to filter out the results on the test set. Like this:

# 'refit' without estimating new params/coefs on all data
deaths_fits_1 <- deaths_fits_0 %>% 
  refit(us_accidental_deaths, reinitialise=FALSE) 

# 1-step forecasts on the test set
deaths_fits_1 %>%
  fitted() %>%
  tail(24)

That also avoids the problem of generating an error when the test set is smaller than the seasonal period.