I'm trying to write a function in R that calculates a central prediction and upper and lower prediction intervals from a trained caret model (i.e., a "train" object) using the 0.632+ Bootstrap approach.
In this effort, I'm attempting to follow a Python example (https://www.saattrupdan.com/posts/2020-03-01-bootstrap-prediction) as a guide. However, I'm having trouble replicating it in R. Any guidance would be appreciated.
My function is supposed to take a trained caret model, the training data, and new data as input and return prediction intervals. However, at present, my prediction interval values are not correct.
As highlighted in a comment by Mark Rieke, one issue is that the entire 0.632+ procedure needs to be done for every bootstrap split, but my current code fails to do this.
Here's my current code:
library(caret)
# Set the random seed for reproducibility
set.seed(123)
# Generate data
n <- 100
explainer <- runif(n)
y <- 1 + 0.2 * explainer + rnorm(n)
data <- data.frame(explainer, y)
# Fit linear regression models
fit_simple <- lm(y ~ explainer) # A plain old linear model
fit_caret <- train(
y = y,
x = data.frame(explainer),
method = "lm"
) # An identical model, but fit using caret
new_data <- data.frame(explainer = runif(15, min = -10, max = 10))
# Function to calculate prediction intervals using 0.632+ Bootstrap
calculate_prediction_intervals <- function(model, new_data, alpha = 0.05) {
# Extract training data and outcomes from the model
X_train <- base::subset(model$trainingData, select = -c(.outcome))
y_train <- as.numeric(model$trainingData$.outcome)
n <- nrow(X_train)
nbootstraps <- as.integer(sqrt(n))
# Initialize matrices to store bootstrap predictions and validation residuals
bootstrap_preds <- matrix(0, nrow(new_data), nbootstraps)
val_residuals <- matrix(0, n, nbootstraps)
for (b in 1:nbootstraps) {
train_idxs <- sample(1:n, n, replace = TRUE)
val_idxs <- setdiff(1:n, train_idxs)
# Fit a bootstrap sample of the model
fit_b <- train(
y = y_train[train_idxs],
x = X_train[train_idxs, , drop = FALSE],
method = model$method,
tuneGrid = model$bestTune,
trControl = trainControl(method = "none", savePredictions = FALSE)
)
# Compute validation set predictions and residuals
preds_val <- predict(fit_b, newdata = X_train[val_idxs, , drop = FALSE])
val_residuals[val_idxs, b] <- y_train[val_idxs] - preds_val
# Compute bootstrap predictions on new data
preds_new <- predict(fit_b, newdata = new_data)
bootstrap_preds[, b] <- preds_new
}
# Center the bootstrap predictions and residuals
bootstrap_preds <- bootstrap_preds - colMeans(bootstrap_preds)
val_residuals <- val_residuals - colMeans(val_residuals)
# Fit the original model to the full training data
fit <- train(
y = y_train,
x = X_train,
method = model$method,
tuneGrid = model$bestTune,
trControl = trainControl(method = "none", savePredictions = FALSE)
)
preds <- predict(fit, newdata = X_train)
train_residuals <- y_train - preds
# Calculate various values needed for 0.632+ Bootstrap
no_information_error <- mean(abs(sample(y_train) - sample(preds)))
generalization <- abs(colMeans(val_residuals) - mean(train_residuals))
no_information_val <- abs(no_information_error - train_residuals)
relative_overfitting_rate <- mean(generalization / no_information_val)
weight <- 0.632 / (1 - 0.368 * relative_overfitting_rate)
# Calculate prediction residuals
residuals <- (1 - weight) * train_residuals + weight * colMeans(val_residuals)
# Calculate prediction percentiles
percentiles <- apply(bootstrap_preds, 1, function(x) {
quantile(x + residuals, probs = c(alpha / 2, 1 - alpha / 2))
})
# Create a data frame with predictions, lower, and upper limits
result <- data.frame(
fit = predict(fit, newdata = new_data),
lwr = percentiles[1, ],
upr = percentiles[2, ]
)
return(result)
}
My code fails to ~reproduce the expected prediction intervals for a linear model. Increasing the number of bootstrap resamples doesn't help this. Can you help me find where I went wrong?
> calculate_prediction_intervals(fit_caret, new_data)
fit lwr upr
1 1.18302967 -0.2597420 1.1699486
2 2.07894173 -1.4669930 7.0949444
3 0.71611677 -2.1804343 0.4431974
4 1.37767478 -0.6438284 2.5235400
5 1.68312227 -0.9393278 4.4294951
6 1.71845385 -1.0413210 4.8058089
7 0.06639059 -6.7192473 1.1929259
8 0.58836348 -3.2036975 0.7598031
9 1.55414870 -0.7131324 3.5583779
10 0.04536204 -6.8536552 1.2401264
11 1.76387322 -1.0177667 5.0307556
12 -0.01836307 -7.4146538 1.4246235
13 1.29583653 -0.4646119 2.0345750
14 0.18768121 -5.8312821 1.0571434
15 1.33552830 -0.4831878 2.0921489
> predict(fit_simple, newdata = new_data, interval= "prediction")
fit lwr upr
1 1.18302967 -0.9262779 3.292337
2 2.07894173 -4.5686088 8.726492
3 0.71611677 -2.0877607 3.519994
4 1.37767478 -1.4345098 4.189859
5 1.68312227 -2.6904110 6.056656
6 1.71845385 -2.8512314 6.288139
7 0.06639059 -6.2672902 6.400071
8 0.58836348 -2.8285939 4.005321
9 1.55414870 -2.1238365 5.232134
10 0.04536204 -6.4117391 6.502463
11 1.76387322 -3.0606644 6.588411
12 -0.01836307 -6.8508475 6.814121
13 1.29583653 -1.1747848 3.766458
14 0.18768121 -5.4394392 5.814802
15 1.33552830 -1.2942424 3.965299
I am aware that alternatives to the method I am trying to replicate exist, e.g., conformal inference or even simply adding the raw residuals to predictions, but I'm hoping for a specific application here. The approach I am after should generally replicate the methods of https://arxiv.org/abs/2201.11676, similar to other approaches that have used tidymodels, e.g., https://www.bryanshalloway.com/2021/04/05/simulating-prediction-intervals/ and the workboots package (https://markjrieke.github.io/workboots/).
I plan to use this function on more complicated models (i.e., many predictors, not just linear models) from caret trained with the x and y data specified. I'm not using the formula method in caret. Due to this complexity, approaches that only work for linear models won't do the trick, either.
Following the approach from the Workboots package, with only a few adjustments to work with the caret objects, we can get all the bootstrapped predictions (with the corrected residuals added), the quantiles of predictions for a given alpha, and the fit on new data using the following code.
Note: This is slightly different from the original Python effort in formulation, though it's the same in effect.
# Function to generate prediction intervals for a caret model using bootstrapping
predict_caret_boots <-
function(model,
n = 2000,
alpha = 0.05,
new_data) {
# Extract training data and outcomes from the model
X_train <- base::subset(model$trainingData, select = -c(.outcome))
y_train <- as.numeric(model$trainingData$.outcome)
# Initialize a list to store predictions
preds_list <- list()
# Loop through n bootstrap resamples
for (i in 1:n) {
# Create a bootstrap sample
train_idxs <- sample(length(y_train), replace = TRUE)
boot_X_train <- X_train[train_idxs, , drop = FALSE]
boot_y_train <- y_train[train_idxs]
boot_X_oob <- X_train[-train_idxs, , drop = FALSE]
boot_y_oob <- y_train[-train_idxs]
# Fit a model on the bootstrap sample
fit_b <- train(
y = boot_y_train,
x = boot_X_train,
method = model$method,
tuneGrid = model$bestTune,
trControl = trainControl(method = "none", savePredictions = FALSE)
)
# Make predictions on the new data
preds <- predict(fit_b, newdata = new_data)
# Make predictions on training data
preds_train <- predict(fit_b, newdata = boot_X_train)
# Make predictions on OOB data
preds_oob <- predict(fit_b, newdata = boot_X_oob)
# Calculate training residuals
resids_train <- boot_y_train - preds_train
resids_train <- resids_train - mean(resids_train)
# Calculate OOB residuals
resids_oob <- boot_y_oob - preds_oob
resids_oob <- resids_oob - mean(resids_oob)
# Calculate no-information error rate (rmse_ni) with RMSE as the loss function
combos <- tidyr::crossing(boot_y_train, preds_train)
rmse_ni <- caret::RMSE(combos$preds_train, combos$boot_y_train)
# Calculate overfit rate
rmse_oob <- caret::RMSE(boot_y_oob, preds_oob)
rmse_train <- caret::RMSE(boot_y_train, preds_train)
overfit <- (rmse_oob - rmse_train) / (rmse_ni - rmse_train)
# Calculate weight (if overfit = 0, weight = .632 & residual used will just be .632)
# Use the actual proportion of distinct training/OOB samples, rather than the average of 0.632/0.368
prop_368 <- length(boot_y_oob) / length(boot_y_train)
prop_632 <- 1 - prop_368
weight <- prop_632 / (1 - (prop_368 * overfit))
# Determine residual std.dev based on weight
sd_oob <- stats::sd(resids_oob)
sd_train <- stats::sd(resids_train)
sd_resid <- weight * sd_oob + (1 - weight) * sd_train
# Add residuals to predictions
preds <- preds + stats::rnorm(length(preds), 0, sd_resid)
# Create a data frame with predictions and add it to the list
preds_df <- data.frame(fit = preds)
preds_list[[i]] <- preds_df
}
# Calculate quantiles for each row of preds_list
preds_list <- data.frame(preds_list)
quantiles <-
apply(preds_list, 1, function(row)
quantile(row, probs = c(alpha / 2, 1 - alpha / 2)))
# Get the central fit, too
fit_new <- predict(model, new_data)
result <- list(
preds = data.frame(preds_list),
quantiles = t(data.frame(quantiles)),
fit = data.frame(fit_new)
)
return(result)
}
A little adjustment to this function could help it explicitly handle preprocessing options from caret, etc. But for now, this appears to do the trick beautifully!