rtime-seriespredictionhidden-markov-models

How to Make a One-Step-Ahead Prediction for Observed State Using hmmTMB in R?


I've been working with the hmmTMB package in R to model and predict oil price movements using a Hidden Markov Model (HMM). After fitting the model with training data, I'm struggling with making one-step-ahead predictions for the observed state (e.g., oil prices) using the fitted model on a test dataset.

# Load necessary libraries
library(hmmTMB)
library(tidyverse)

# Set seed for reproducibility
set.seed(123)

# Example dataset preparation (extended for 1 year)
df <- data.frame(
    day = seq.Date(from = as.Date("2020-01-01"), to = as.Date("2020-12-31"), by = "day"),
    OILPRICE = rnorm(366, 100, 10)
    ) %>%
    as_tibble()

# Splitting into training and test sets
split_index <- round(nrow(df) * 0.8)
training_set <- df[1:split_index, ] %>% as.data.frame()
testing_set <- df[(split_index + 1):nrow(df), ] %>% as.data.frame()

# Define a hidden Markov model with 2 states for the training set
hid1 <- MarkovChain$new(data = training_set, n_states = 2)
dists <- list(OILPRICE = "norm")
par0 <- list(OILPRICE = list(mean = c(110, 90), sd = c(1, 1)))
obs1 <- Observation$new(data = training_set, n_states = 2, dists = dists, par = par0)
par0 <- obs1$suggest_initial()
obs1 <- Observation$new(data = training_set, n_states = 2, dists = dists, par = par0)
hmm1 <- HMM$new(obs = obs1, hid = hid1)
hmm1$fit(silent = TRUE)

# Rename
data_plot <- training_set

# Color by most likely state sequence
data_plot$state <- factor(paste0("State ", hmm1$viterbi()))
ggplot(data_plot, aes(day, OILPRICE, col = state)) +
    geom_point() +
    scale_color_manual(values = pal, name = NULL)

# Attempting to make a one-step-ahead prediction
# ??? This is where I need guidance

enter image description here

How can I use hmmTMB to make a one-step-ahead prediction for the observed state (e.g., OILPRICE on the next day) using the fitted model (hmm1) on my test dataset (testing_set)? I am looking for a way to predict future OILPRICE values based on the model fitted to the training set.

I appreciate any insights or examples on how to achieve this with the hmmTMB package.


Solution

  • There are currently no built-in functions for forecasting in hmmTMB (as of February 2024). A forecasting distribution can however be computed from a fitted hmmTMB model.

    The forecast distribution is a mixture of the state-dependent distributions, where the weights correspond to the probability of being in the different states. So, you can use the following workflow:

    1. get state distribution for last observed data row (i.e., at time n); let's call this u
    2. get state distribution at time n + h as u %*% (tpm %^% h), where tpm is the (one-step) transition probability matrix of the state process; that is, we multiply u by the h-step transition probability matrix
    3. get forecast distribution as a mixture of the estimated state-dependent distributions, weighted by the state probabilities found in the previous step

    In step 3, you could for example compute the forecast mean (as a weighted sum of the state-dependent means), or the probability density function of the forecast distribution as shown in the code below.

    Forecast mean against time

    The black dots show the mean of the forecast distribution for 60 days after the last observation of the training set.

    Forecast mean against time

    Forecast distribution

    The lines show the probability density functions of the forecast distribution for the 60 days following the last observation of the training set. As you can see, it's very likely to be in state 2 at first, so the corresponding state-dependent distribution has a higher weight. After a while, as the state distribution approaches the stationary (long-term) distribution, the forecast distribution stops changing.

    Forecast distributions

    Code

    # Load packages
    library(scico)
    library(expm)
    
    # Grid of oil prices to plot forecast distributions
    grid <- seq(min(training_set$OILPRICE), 
                max(training_set$OILPRICE), 
                length = 100)
    
    # Get state distribution at last (training) observation
    sp <- hmm1$state_probs()
    u <- sp[nrow(sp),]
    
    # Get model parameters
    tpm <- hid1$tpm()[,,1]
    obspar <- obs1$par()[,,1]
    
    # Loop over forecast times (over 60 days here)
    h <- 1:60
    mix <- list()
    mix_mean <- rep(NA, length = length(h))
    for(i in seq_along(h)) {
        # State distribution at time n + h
        dist_h <- u %*% (tpm %^% h[i])
        
        # Forecast distribution at time n + h
        mix[[i]] <- 
            dist_h[1] * dnorm(x = grid, 
                              mean = obspar["OILPRICE.mean", "state 1"], 
                              sd = obspar["OILPRICE.sd", "state 1"]) +
            dist_h[2] * dnorm(x = grid, 
                              mean = obspar["OILPRICE.mean", "state 2"], 
                              sd = obspar["OILPRICE.sd", "state 2"])
        
        # Mean of forecast distribution
        mix_mean[i] <- sum(dist_h * obspar["OILPRICE.mean",])
    }
    
    # Plot the forecast mean against time
    df_mean <- data.frame(day = seq.Date(from = as.Date("2020-10-20"), 
                                         by = "day", length = length(h)),
                          mean = mix_mean)
    
    ggplot(data_plot, aes(day, OILPRICE, col = state)) +
        geom_point() +
        geom_point(aes(y = mean, col = NULL), data = df_mean) +
        scale_color_manual(values = pal)
    
    # Plot the forecast distributions for h = 1, 2, ..., 60
    df_mix <- data.frame(h = rep(h, each = 100),
                     price = grid,
                     mix = unlist(mix))
    
    ggplot(df_mix, aes(price, mix, col = h, group = h)) +
        geom_line() +
        scale_color_scico(name = "time") +
        labs(y = "forecast distribution")
    

    Edit

    Note that I used a slightly different simulated data set, generated from a 2-state hidden Markov model, to create the kind of patterns we would expect to find in real data. The data can be simulated as follows:

    # Set seed for reproducibility
    set.seed(123)
    
    # Generate state process for simulated data
    markov <- rep(1, 366)
    for(i in 2:366) {
        switch <- sample(0:1, size = 1, prob = c(0.95, 0.05))
        if(switch) markov[i] <- 3 - markov[i-1]
        else markov[i] <- markov[i-1]
    }
    
    # Example dataset preparation (extended for 1 year)
    df <- data.frame(
        day = seq.Date(from = as.Date("2020-01-01"), 
                       to = as.Date("2020-12-31"), 
                       by = "day"),
        
        # Generate observations conditionally on state process
        OILPRICE = rnorm(366, c(100, 140)[markov], 10)
    )