rtime-seriessimulationforecastingsurvival

How to generate multiple simulation paths in survival analysis forecasting?


I am trying to create R code for generating multiple simulation paths for forecasting survival probabilities. In the code posted at the bottom of this post, I take the survival package's lung dataset and create a new dataframe lung1, representing the lung data "as if" the study max period were 500 instead of the 1022 it actually is in the lung data. I use a parametric Weibull distribution per goodness-of-fit tests I ran separately. I'm trying to forecast, via multiple simulation paths, survival curves for periods 501-1000 using, ideally as a random number generation guide the Weibull parameters for the data for periods 1-500. This exercise is a forecasting "what-if", if I only had data for a 500-period lung study. I then compare the forecasts with the actual lung data for periods 501-1000.

The shape and scale parameters I extracted from the lung1 data are 1.804891 and 306.320693, respectively.

I'm having difficulty generating reasonable, monotonically decreasing simulation paths for forecast periods 501-1000. In looking at the code posted at the bottom, what should I be doing instead?

The below images help illustrate:

  1. First image is a K-M plot showing survival probabilities for the entire lung dataset.
  2. Second image plots lung1 (500 assumed study periods) with period 501-1000 forecasts extending in grey lines. Obviously something is not quite right!
  3. Third image is only there to show simulations I've done in the past before using time-series models such as ETS, which sort of gets at what I'm trying to do here with survival analysis. This isn't my best example, I've generated nice monotonically decreasing, concave forecast curves using log transformations and ETS. I am now trying to understand survival analysis better, no more ETS for now.

enter image description here

Code:

library(fitdistrplus)
library(dplyr)
library(survival)
library(MASS)

# Modify lung dataset as if study had only lasted 500 periods
lung1 <- lung %>% 
  mutate(time1 = ifelse(time >= 500, 500, time)) %>% 
  mutate(status1 = ifelse(status == 2 & time >= 500, 1, status))

fit1 <- survfit(Surv(time1, status1) ~ 1, data = lung1)

# Get survival probability values at each time point
surv_prob <- summary(fit1, times = seq(0, 500, by = 1))$surv

# Create a data frame with time and survival probability values
lungValues <- data.frame(Time = seq(0, 500, by = 1), Survival_Probability = surv_prob)

# Plot the survival curve using the new data frame
plot(lungValues$Time, lungValues$Survival_Probability, xlab = "Time", ylab = "Survival Probability",
     main = "Survival Plot", type = "l", col = "blue", xlim = c(0, 1000), ylim = c(0, 1))

# Generate correlation matrix for Weibull parameters
cor_matrix <- matrix(c(1.0, 0.5, 0.5, 1.0), nrow = 2, ncol = 2)

# Generate simulation paths for forecasting
num_simulations <- 10
forecast_period <- seq(501, 1000, by = 1)
start_prob <- 0.293692
shape <- 1.5
scale <- 100

for (i in 1:num_simulations) {
  # Generate random variables for the Weibull distribution
  random_vars <- mvrnorm(length(forecast_period), c(0, 0), Sigma = cor_matrix)
  shape_values <- exp(random_vars[,1])
  scale_values <- exp(random_vars[,2]) * scale
  
  # Calculate the survival probabilities for the forecast period
  surv_prob <- numeric(length(forecast_period))
  surv_prob[1] <- start_prob
  for (j in 2:length(forecast_period)) {
    # Calculate the survival probability using the Weibull distribution
    surv_prob[j] <- pweibull(forecast_period[j] - 500, shape = shape_values[j], scale = scale_values[j], lower.tail = FALSE)
    # Ensure the survival probability follows a monotonically decreasing, concave path
    if (surv_prob[j] > surv_prob[j-1]) {
      surv_prob[j] <- surv_prob[j-1] - runif(1, 0, 0.0005)
    }
  }
  
  # Combine the survival probabilities with the forecast period and create a data frame
  df <- data.frame(Time = forecast_period, Survival_Probability = surv_prob)
  
  # Add the simulation path to the plot
  lines(df$Time, df$Survival_Probability, type = "l", col = "grey")
}

Solution

  • I find the logic of your code very difficult to follow, so I'm not entirely sure if this is what you're looking for, but it seems reasonable to me.

    My starting point is your lung1 data frame.

    First, fit a Weibull ditsribution to your data and obtain the fitted estimates of shape and scale. Note that survreg and rweibull use different parameterisations.

    # Load the necessary packages
    library(fitdistrplus)
    library(dplyr)
    library(survival)
    library(MASS)
    library(tibble)
    library(ggplot2)
    
    # Modify lung dataset as if study had only lasted 500 periods
    lung1 <- lung %>%
      mutate(time1 = ifelse(time >= 500, 500, time)) %>%
      mutate(status1 = ifelse(status == 2 & time >= 500, 1, status))
    
    wFit <- survreg(Surv(time1,status1)~1, dist="w", data=lung1)
    
    # See online doc for survreg for explanation
    scale <- exp(wFit$coefficients[1])
    shape <- 1/wFit$scale
    median(rweibull(2000, shape, scale))
    [1] 326.3782
    

    [survfit reports a median of 310.]

    Now, write a function to impute survival times for survival beyond time 500 and then derive a Kaplan-Meier estimate of the survival distribution. The function takes a single parameter, x. If x is 1, Estimated survival probabilities for time < 500 are included in the return value, otherwise, they are not.

    # Generate simulation paths for forecasting
    num_simulations <- 10
    start_prob <- 0.293692
    
    simulate <- function(x=1) {
      lung2 <- lung1 %>% 
        mutate(
          time2 = ifelse(
                    time > 500, 
                    qweibull(runif(nrow(.), min=1-start_prob), shape, scale), 
                    time
                  ),
          # It's possible that some simulated survival times are *just* under 500
          status2 = ifelse(time == 500, 2, status)
        )
      summary2 <- summary(survfit(Surv(time2, status2) ~ 1, data = lung2))
      result <- tibble(Simulation=x, Time=summary2$time, Survival=summary2$surv) %>% 
        add_row(Simulation=x, Time=0, Survival=1, .before=1)
      if (x == 1) {
        return(result %>% mutate(Type=Time > 500))
      } 
      return(result %>% filter(Time > 500) %>% add_column(Type=TRUE))
    }
    

    Now use the function to simulate survival times > 500 several times. Bind the results into a single data frame.

    simulations <- lapply(1:num_simulations, simulate) %>% bind_rows()
    

    Plot the results.

    simulations %>% 
      ggplot(
        aes(x=Time, y=Survival, group=interaction(Simulation, Type), colour=Type)
      ) +
      geom_step(show.legend=FALSE) +
      scale_colour_manual(values=c("blue", "grey"))
    

    enter image description here

    I hope that helps.