r time-series prediction hidden-markov-models

How to Make a One-Step-Ahead Prediction for Observed State Using hmmTMB in R?

I've been working with the hmmTMB package in R to model and predict oil price movements using a Hidden Markov Model (HMM). After fitting the model with training data, I'm struggling with making one-step-ahead predictions for the observed state (e.g., oil prices) using the fitted model on a test dataset.

# Load necessary libraries
library(hmmTMB)
library(tidyverse)

# Set seed for reproducibility
set.seed(123)

# Example dataset preparation (extended for 1 year)
df <- data.frame(
    day = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-12-31"), by = "day"),
    OILPRICE = rnorm(366, 100, 10)
    ) %>%
    as_tibble()

# Splitting into training and test sets
split_index <- round(nrow(df) * 0.8)
training_set <- df[1:split_index, ] %>% as.data.frame()
testing_set <- df[(split_index + 1):nrow(df), ] %>% as.data.frame()

# Define a hidden Markov model with 2 states for the training set
hid1 <- MarkovChain$new(data = training_set, n_states = 2)
dists <- list(OILPRICE = "norm")
par0 <- list(OILPRICE = list(mean = c(110, 90), sd = c(1, 1)))
obs1 <- Observation$new(data = training_set, n_states = 2, dists = dists, par = par0)
par0 <- obs1$suggest_initial()
obs1 <- Observation$new(data = training_set, n_states = 2, dists = dists, par = par0)
hmm1 <- HMM$new(obs = obs1, hid = hid1)
hmm1$fit(silent = TRUE)

# Rename
data_plot <- training_set

# Color by most likely state sequence
data_plot$state <- factor(paste0("State ", hmm1$viterbi()))
ggplot(data_plot, aes(day, OILPRICE, col = state)) +
    geom_point() +
    scale_color_manual(values = pal, name = NULL)

# Attempting to make a one-step-ahead prediction
# ??? This is where I need guidance

enter image description here

How can I use hmmTMB to make a one-step-ahead prediction for the observed state (e.g., OILPRICE on the next day) using the fitted model (hmm1) on my test dataset (testing_set)? I am looking for a way to predict future OILPRICE values based on the model fitted to the training set.

I appreciate any insights or examples on how to achieve this with the hmmTMB package.

Solution

There are currently no built-in functions for forecasting in hmmTMB (as of February 2024). A forecasting distribution can however be computed from a fitted hmmTMB model.

The forecast distribution is a mixture of the state-dependent distributions, where the weights correspond to the probability of being in the different states. So, you can use the following workflow:

get state distribution for last observed data row (i.e., at time n); let's call this u
get state distribution at time n + h as u %*% (tpm %^% h), where tpm is the (one-step) transition probability matrix of the state process; that is, we multiply u by the h-step transition probability matrix
get forecast distribution as a mixture of the estimated state-dependent distributions, weighted by the state probabilities found in the previous step

In step 3, you could for example compute the forecast mean (as a weighted sum of the state-dependent means), or the probability density function of the forecast distribution as shown in the code below.

Forecast mean against time

The black dots show the mean of the forecast distribution for 60 days after the last observation of the training set.

Forecast distribution

The lines show the probability density functions of the forecast distribution for the 60 days following the last observation of the training set. As you can see, it's very likely to be in state 2 at first, so the corresponding state-dependent distribution has a higher weight. After a while, as the state distribution approaches the stationary (long-term) distribution, the forecast distribution stops changing.

Code

# Load packages
library(scico)
library(expm)

# Grid of oil prices to plot forecast distributions
grid <- seq(min(training_set$OILPRICE), 
            max(training_set$OILPRICE), 
            length = 100)

# Get state distribution at last (training) observation
sp <- hmm1$state_probs()
u <- sp[nrow(sp),]

# Get model parameters
tpm <- hid1$tpm()[,,1]
obspar <- obs1$par()[,,1]

# Loop over forecast times (over 60 days here)
h <- 1:60
mix <- list()
mix_mean <- rep(NA, length = length(h))
for(i in seq_along(h)) {
    # State distribution at time n + h
    dist_h <- u %*% (tpm %^% h[i])
    
    # Forecast distribution at time n + h
    mix[[i]] <- 
        dist_h[1] * dnorm(x = grid, 
                          mean = obspar["OILPRICE.mean", "state 1"], 
                          sd = obspar["OILPRICE.sd", "state 1"]) +
        dist_h[2] * dnorm(x = grid, 
                          mean = obspar["OILPRICE.mean", "state 2"], 
                          sd = obspar["OILPRICE.sd", "state 2"])
    
    # Mean of forecast distribution
    mix_mean[i] <- sum(dist_h * obspar["OILPRICE.mean",])
}

# Plot the forecast mean against time
df_mean <- data.frame(day = seq.Date(from = as.Date("2020-10-20"), 
                                     by = "day", length = length(h)),
                      mean = mix_mean)

ggplot(data_plot, aes(day, OILPRICE, col = state)) +
    geom_point() +
    geom_point(aes(y = mean, col = NULL), data = df_mean) +
    scale_color_manual(values = pal)

# Plot the forecast distributions for h = 1, 2, ..., 60
df_mix <- data.frame(h = rep(h, each = 100),
                 price = grid,
                 mix = unlist(mix))

ggplot(df_mix, aes(price, mix, col = h, group = h)) +
    geom_line() +
    scale_color_scico(name = "time") +
    labs(y = "forecast distribution")

Edit

Note that I used a slightly different simulated data set, generated from a 2-state hidden Markov model, to create the kind of patterns we would expect to find in real data. The data can be simulated as follows:

# Set seed for reproducibility
set.seed(123)

# Generate state process for simulated data
markov <- rep(1, 366)
for(i in 2:366) {
    switch <- sample(0:1, size = 1, prob = c(0.95, 0.05))
    if(switch) markov[i] <- 3 - markov[i-1]
    else markov[i] <- markov[i-1]
}

# Example dataset preparation (extended for 1 year)
df <- data.frame(
    day = seq.Date(from = as.Date("2020-01-01"), 
                   to = as.Date("2020-12-31"), 
                   by = "day"),
    
    # Generate observations conditionally on state process
    OILPRICE = rnorm(366, c(100, 140)[markov], 10)
)