[SOLVED] Does fabletools::skill_score respect transformations of the target variable?

Does fabletools::skill_score respect transformations of the target variable?

While testing the accuracy of some models using fable, I found an interesting behavior with fabletools::skill_score. skill_score is described in the FPP3 book. If you calculate the test accuracy of a set of models that include a NAIVE/SNAIVE model with skill_score(CRPS) with no transformation of the target variable, the NAIVE/SNAIVE model has a skill_score of 0. This aligns with the description in the FPP3 book:

the proportion that the ... method improves over the naïve method based on CRPS

However, if you transform the target variable somehow (ex. log(x + 1)), the NAIVE/SNAIVE model does not have a skill_score of 0. This indicates to me that the skill_score function might not be honoring the transformation of the target variable. I looked at the source code and did not see any reference to transformations.

Is this the expected behavior of skill_score? If so, is there a way to carry the transformation over to skill_score? Or is skill_score not appropriate for models with transformed target variables?

This code replicates the expected behavior of skill_score on untransformed data:

library(fpp3)

google_stock <- gafa_stock |>
  filter(Symbol == "GOOG", year(Date) >= 2015) |>
  mutate(day = row_number()) |>
  update_tsibble(index = day, regular = TRUE)

google_stock |> 
  autoplot()

test <- google_stock |> 
  slice_tail(prop = .8)

train <- google_stock |> 
  anti_join(test)

fitted_model <- train |> 
  model(
    Mean = MEAN(Close),
    `Naïve` = NAIVE(Close),
    Drift = NAIVE(Close ~ drift())
  )

goog_fc <- fitted_model |> 
  forecast(h = 12)

fc_acc <- goog_fc |> 
  accuracy(google_stock,
           measures = list(point_accuracy_measures, distribution_accuracy_measures, crps_skill = skill_score(CRPS))) |> 
  select(.model, .type, CRPS, crps_skill, RMSSE)

fc_acc
# A tibble: 3 × 5
  .model .type  CRPS crps_skill RMSSE
  <chr>  <chr> <dbl>      <dbl> <dbl>
1 Drift  Test   38.2     0.0955  5.09
2 Mean   Test  109.     -1.59   12.6 
3 Naïve  Test   42.2     0       5.49

This replicates the unexpected behavior with the same data transformed with log(x + 1):

fitted_model_transformed <- train |> 
  model(
    Mean = MEAN(log(Close + 1)),
    `Naïve` = NAIVE(log(Close + 1)),
    Drift = NAIVE(log(Close + 1) ~ drift())
  )

goog_fc_transformed <- fitted_model_transformed |> 
  forecast(h = 12)

fc_acc_transformed <- goog_fc_transformed |> 
  accuracy(google_stock,
           measures = list(point_accuracy_measures, distribution_accuracy_measures, crps_skill = skill_score(CRPS))) |> 
  select(.model, .type, CRPS, crps_skill, RMSSE)

fc_acc_transformed
# A tibble: 3 × 5
  .model .type  CRPS crps_skill RMSSE
  <chr>  <chr> <dbl>      <dbl> <dbl>
1 Drift  Test   36.3     0.140   4.97
2 Mean   Test  110.     -1.61   12.6 
3 Naïve  Test   40.8     0.0353  5.42

I would expect the Naïve model crps_skill to be 0, because it cannot improve on itself.

> sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8  LC_CTYPE=English_United States.utf8    LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                           LC_TIME=English_United States.utf8    

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] fable_0.3.3       feasts_0.3.1      fabletools_0.3.4  tsibbledata_0.4.1 tsibble_1.1.3     ggplot2_3.4.3     lubridate_1.9.2  
 [8] tidyr_1.3.0       dplyr_1.1.3       tibble_3.2.1      fpp3_0.5         

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3       plotly_4.10.2        utf8_1.2.4           generics_0.1.3       anytime_0.3.9        digest_0.6.33       
 [7] magrittr_2.0.3       grid_4.3.1           timechange_0.2.0     pkgload_1.3.2.1      fastmap_1.1.1        jsonlite_1.8.7      
[13] modeldata_1.2.0      httr_1.4.7           purrr_1.0.2          fansi_1.0.5          viridisLite_0.4.2    scales_1.2.1        
[19] numDeriv_2016.8-1.1  textshaping_0.3.6    lazyeval_0.2.2       cli_3.6.1            rlang_1.1.1          crayon_1.5.2        
[25] ellipsis_0.3.2       munsell_0.5.0        withr_2.5.1          tools_4.3.1          colorspace_2.1-0     vctrs_0.6.4         
[31] R6_2.5.1             lifecycle_1.0.3      htmlwidgets_1.6.2    ragg_1.2.5           pkgconfig_2.0.3      progressr_0.14.0    
[37] pillar_1.9.0         gtable_0.3.4         rsconnect_1.1.0      data.table_1.14.8    glue_1.6.2           Rcpp_1.0.11         
[43] systemfonts_1.0.4    tidyselect_1.2.0     rstudioapi_0.15.0    farver_2.1.1         htmltools_0.5.6      labeling_0.4.3      
[49] compiler_4.3.1       distributional_0.3.2

Solution

You can use several different transformations in the same model() call, so it makes no sense for skill_score() to use a benchmark model with anything other than no transformation. Otherwise, the scores for different models could use different benchmarks. Consequently, the benchmark Naive method must use an untransformed variable.