rdplyrforecastingfabletidyverts

Aggregating forecasts using Fable


Issue: Using fable I can easily produce forecasts on a time series with a grouped structure, and can even use Fable's aggregate_key/ reconcile syntax to produce a coherent top-level forecast. However I'm unable to easily access the aggregate forecasts using this method, and the alternative I'm using involves abandoning the fable (forecast table) structure. Can anyone tell me if there's an easier/intended way to do this using the package? As you can see in the examples, I'm able to get there using other methods, but I'd like to know if there's a better way. Any help gratefully received!

Approach 1: My efforts to summarise the forecast without using aggregate_key/ reconcile have been mainly using dplyr's group_by and summarise, however the prediction interval for the forecast is formatted as a normal distribution object, which doesn't seem to support summing using this method. To get around this I've been using hilo and unpack_hilo to extract bounds for different prediction intervals, which can then be summed using the usual method. However I'd really like to retain the fable structure and the distribution objects, which is impossible using this method.

Approach 2: The alternative, using aggregate_key/ reconcile only seems to support aggregation using min_trace. I understand that this method is for optimum reconciliation, whereas what I want is a simple bottom-up aggregate forecast. It feels like there should be an easy way to get bottom-up forecasts using this syntax, but I haven't found one so far. Moreover, even using min_trace I'm unsure how to access the aggregate forecast itself as you can see in the example!

Example using approach 1:

library(fable)
#> Loading required package: fabletools
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

lung_deaths_agg <- as_tsibble(cbind(mdeaths, fdeaths))
  
fc_1 <- lung_deaths_agg %>% 
  model(lm = TSLM(value ~ trend() + season())) %>% 
  forecast()

fc_1
#> # A fable: 48 x 5 [1M]
#> # Key:     key, .model [2]
#>    key     .model    index        value .mean
#>    <chr>   <chr>     <mth>       <dist> <dbl>
#>  1 fdeaths lm     1980 Jan N(794, 5940)  794.
#>  2 fdeaths lm     1980 Feb N(778, 5940)  778.
#>  3 fdeaths lm     1980 Mar N(737, 5940)  737.
#>  4 fdeaths lm     1980 Apr N(577, 5940)  577.
#>  5 fdeaths lm     1980 May N(456, 5940)  456.
#>  6 fdeaths lm     1980 Jun N(386, 5940)  386.
#>  7 fdeaths lm     1980 Jul N(379, 5940)  379.
#>  8 fdeaths lm     1980 Aug N(335, 5940)  335.
#>  9 fdeaths lm     1980 Sep N(340, 5940)  340.
#> 10 fdeaths lm     1980 Oct N(413, 5940)  413.
#> # ... with 38 more rows

fc_1 %>%
  hilo() %>% 
  unpack_hilo(c(`80%`, `95%`)) %>% 
  as_tibble() %>% 
  group_by(index) %>% 
  summarise(across(c(.mean, ends_with("upper"), ends_with("lower")), sum))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 24 x 6
#>       index .mean `80%_upper` `95%_upper` `80%_lower` `95%_lower`
#>       <mth> <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#>  1 1980 Jan 2751.       3089.       3267.       2414.       2236.
#>  2 1980 Feb 2687.       3024.       3202.       2350.       2171.
#>  3 1980 Mar 2535.       2872.       3051.       2198.       2020.
#>  4 1980 Apr 2062.       2399.       2577.       1725.       1546.
#>  5 1980 May 1597.       1934.       2113.       1260.       1082.
#>  6 1980 Jun 1401.       1738.       1916.       1064.        885.
#>  7 1980 Jul 1343.       1680.       1858.       1006.        827.
#>  8 1980 Aug 1200.       1538.       1716.        863.        685.
#>  9 1980 Sep 1189.       1527.       1705.        852.        674.
#> 10 1980 Oct 1482.       1819.       1998.       1145.        967.
#> # ... with 14 more rows

Example using approach 2:

fc_2 <- lung_deaths_agg %>%
  aggregate_key(key, value = sum(value)) %>% 
  model(lm = TSLM(value ~ trend() + season())) %>%
  reconcile(lm = min_trace(lm)) %>% 
  forecast()

fc_2
#> # A fable: 72 x 5 [1M]
#> # Key:     key, .model [3]
#>    key     .model    index        value .mean
#>    <chr>   <chr>     <mth>       <dist> <dbl>
#>  1 fdeaths lm     1980 Jan N(794, 5606)  794.
#>  2 fdeaths lm     1980 Feb N(778, 5606)  778.
#>  3 fdeaths lm     1980 Mar N(737, 5606)  737.
#>  4 fdeaths lm     1980 Apr N(577, 5606)  577.
#>  5 fdeaths lm     1980 May N(456, 5606)  456.
#>  6 fdeaths lm     1980 Jun N(386, 5606)  386.
#>  7 fdeaths lm     1980 Jul N(379, 5606)  379.
#>  8 fdeaths lm     1980 Aug N(335, 5606)  335.
#>  9 fdeaths lm     1980 Sep N(340, 5606)  340.
#> 10 fdeaths lm     1980 Oct N(413, 5606)  413.
#> # ... with 62 more rows

fc_2 %>% as_tibble() %>% select(key) %>% slice(50:55)
#> # A tibble: 6 x 1
#>   key         
#>   <chr>       
#> 1 <aggregated>
#> 2 <aggregated>
#> 3 <aggregated>
#> 4 <aggregated>
#> 5 <aggregated>
#> 6 <aggregated>

fc_2 %>% as_tibble() %>% select(key) %>% filter(key == "<aggregated>")
#> # A tibble: 0 x 1
#> # ... with 1 variable: key <chr>

Solution

  • Approach 1:

    Working with distributions requires more care (than numbers) when adding things together. More specifically, the mean of a Normal distribution can be added without issue:

    library(distributional)
    mean(dist_normal(2,3) + dist_normal(4,1))
    #> [1] 6
    mean(dist_normal(2,3)) + mean(dist_normal(4,1))
    #> [1] 6
    

    Created on 2020-07-03 by the reprex package (v0.3.0)

    However the quantiles (used to produce your 80% and 95% intervals) cannot:

    library(distributional)
    quantile(dist_normal(2,3) + dist_normal(4,1), 0.9)
    #> [1] 10.05262
    quantile(dist_normal(2,3), 0.9) + quantile(dist_normal(4,1), 0.9)
    #> [1] 11.12621
    

    Created on 2020-07-03 by the reprex package (v0.3.0)

    If you want to aggregate distributions, you'll need to compute the sum on the distribution itself:

    library(fable)
    library(dplyr)
    lung_deaths_agg <- as_tsibble(cbind(mdeaths, fdeaths))
    
    fc_1 <- lung_deaths_agg %>% 
      model(lm = fable::TSLM(value ~ trend() + season())) %>% 
      forecast()
    fc_1 %>% 
      summarise(value = sum(value), .mean = mean(value))
    #> # A fable: 24 x 3 [1M]
    #>       index          value .mean
    #>       <mth>         <dist> <dbl>
    #>  1 1980 Jan N(2751, 40520) 2751.
    #>  2 1980 Feb N(2687, 40520) 2687.
    #>  3 1980 Mar N(2535, 40520) 2535.
    #>  4 1980 Apr N(2062, 40520) 2062.
    #>  5 1980 May N(1597, 40520) 1597.
    #>  6 1980 Jun N(1401, 40520) 1401.
    #>  7 1980 Jul N(1343, 40520) 1343.
    #>  8 1980 Aug N(1200, 40520) 1200.
    #>  9 1980 Sep N(1189, 40520) 1189.
    #> 10 1980 Oct N(1482, 40520) 1482.
    #> # … with 14 more rows
    

    Created on 2020-07-03 by the reprex package (v0.3.0)

    Note that this will require the development versions of fabletools (>=0.2.0.9000) and distributional (>=0.1.0.9000) as I have added new features to make this example work.

    Approach 2:

    Experimental support for bottom up reconciliation is available using fabletools:::bottom_up(). This is currently an internal function as I'm still working on some details of how reconciliation can be done more generally in fabletools.

    Matching aggregated values should be done with is_aggregated().

    fc_2 <- lung_deaths_agg %>%
      aggregate_key(key, value = sum(value)) %>% 
      model(lm = TSLM(value ~ trend() + season())) %>%
      reconcile(lm = min_trace(lm)) %>% 
      forecast()
    
    fc_2 %>% 
      filter(is_aggregated(key))
    #> # A fable: 24 x 5 [1M]
    #> # Key:     key, .model [1]
    #>    key          .model    index          value .mean
    #>    <chr>        <chr>     <mth>         <dist> <dbl>
    #>  1 <aggregated> lm     1980 Jan N(2751, 24989) 2751.
    #>  2 <aggregated> lm     1980 Feb N(2687, 24989) 2687.
    #>  3 <aggregated> lm     1980 Mar N(2535, 24989) 2535.
    #>  4 <aggregated> lm     1980 Apr N(2062, 24989) 2062.
    #>  5 <aggregated> lm     1980 May N(1597, 24989) 1597.
    #>  6 <aggregated> lm     1980 Jun N(1401, 24989) 1401.
    #>  7 <aggregated> lm     1980 Jul N(1343, 24989) 1343.
    #>  8 <aggregated> lm     1980 Aug N(1200, 24989) 1200.
    #>  9 <aggregated> lm     1980 Sep N(1189, 24989) 1189.
    #> 10 <aggregated> lm     1980 Oct N(1482, 24989) 1482.
    #> # … with 14 more rows
    

    Created on 2020-07-03 by the reprex package (v0.3.0)

    Comparing an aggregated vector with "<aggregated>" is ambiguous, as your key's character value may be "<aggregated>" without the value being <aggregated>. I've now updated fabletools to match "<aggregated>" with aggregated values with a warning and hint, so this code now gives:

    fc_2 %>% 
      filter(key == "<aggregated>")
    #> Warning: <aggregated> character values have been converted to aggregated values.
    #> Hint: If you're trying to compare aggregated values, use `is_aggregated()`.
    #> # A fable: 24 x 5 [1M]
    #> # Key:     key, .model [1]
    #>    key          .model    index          value .mean
    #>    <chr>        <chr>     <mth>         <dist> <dbl>
    #>  1 <aggregated> lm     1980 Jan N(2751, 24989) 2751.
    #>  2 <aggregated> lm     1980 Feb N(2687, 24989) 2687.
    #>  3 <aggregated> lm     1980 Mar N(2535, 24989) 2535.
    #>  4 <aggregated> lm     1980 Apr N(2062, 24989) 2062.
    #>  5 <aggregated> lm     1980 May N(1597, 24989) 1597.
    #>  6 <aggregated> lm     1980 Jun N(1401, 24989) 1401.
    #>  7 <aggregated> lm     1980 Jul N(1343, 24989) 1343.
    #>  8 <aggregated> lm     1980 Aug N(1200, 24989) 1200.
    #>  9 <aggregated> lm     1980 Sep N(1189, 24989) 1189.
    #> 10 <aggregated> lm     1980 Oct N(1482, 24989) 1482.
    #> # … with 14 more rows
    

    Created on 2020-07-03 by the reprex package (v0.3.0)