rmachine-learningp-valuemlrsignificance

p-values for optimization on validation set MLR


I have optimized some algorithms (in mlr3) on a validation set :

I have extracted the balanced accuracy of each algorithm but I'd like to know if there is a possibility to have the p-values of each prediction versus random prediction (featureless selection)... With a Friedman test I can compute it via benchmarking but it is only on my test set... Here is my code :

#Auto tuning Ranger
learner_ranger = lrn("classif.ranger", predict_type = "prob", num.trees = to_tune(1, 2000), mtry.ratio = to_tune(0, 1), sample.fraction = to_tune(1e-1, 1), importance = "impurity")

set.seed(1234)
at_ranger = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_ranger,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning knn
learner_kknn = lrn("classif.kknn", predict_type = "prob", k = to_tune(1, 30))

at_kknn = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_kknn,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning xgboost
learner_xgboost = lrn("classif.xgboost", predict_type = "prob", nrounds = to_tune(1, 5000), eta = to_tune(1e-4, 1, logscale = TRUE), subsample = to_tune(0.1,1), max_depth = to_tune(1,15), min_child_weight = to_tune(0, 7), colsample_bytree = to_tune(0,1), colsample_bylevel = to_tune(0,1), lambda = to_tune(1e-3, 1e3, logscale = TRUE), alpha = to_tune(1e-3, 1e3, logscale = TRUE))

at_xgboost = auto_tuner(
  tuner= tnr("random_search"),
  learner = learner_xgboost,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning rpart
learner_rpart = lrn("classif.rpart", cp = to_tune(0.0001,1),  minsplit = to_tune(1, 60),  maxdepth = to_tune(1, 30), minbucket = to_tune(1,60), predict_type = "prob")

at_rpart = auto_tuner(
 tuner= tnr("random_search"),
  learner = learner_rpart,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

set.seed(1234)
#Auto tuning svm
learner_svm = lrn("classif.svm", type = "C-classification", cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE)), gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE)), kernel = to_tune(c("polynomial", "radial")), degree = to_tune(1, 4), predict_type = "prob")

at_svm = auto_tuner(
 tuner= tnr("random_search"),
  learner = learner_svm,
  resampling = resampling_inner,
  measure = msr("classif.bacc"),
  term_evals = 20,
  store_tuning_instance = TRUE,
  store_models = TRUE
)

For example with Xgboost :

learners = c(lrn("classif.featureless"), at_xgboost)

set.seed(1234)
design = benchmark_grid(
  tasks = list(task, task_imp, task_Emb, task_top, task_Pearson, task_IG), 
  learners = learners,
  resamplings = resampling_outer)
 bmr = benchmark(design,store_models = TRUE)
 
 results_xgboost <- bmr$aggregate(measures)
 print(results_xgboost)
 
archives = extract_inner_tuning_archives(bmr)
inner_learners = map(archives$resample_result, "learners")

best_bacc_xgboost <- aggregate(classif.bacc ~ task_id, data = archives, max)

I have extracted my best balanced accuracy result here but I would like to know, in comparaison with my feature selection technique, if there is a significative difference, or is it just the hazard ?

I had a beginning of response here with this topic : https://stats.stackexchange.com/questions/368176/how-to-determine-whether-a-classifier-is-significantly-better-than-random-guessi

"To assess whether one classifier is better than a "benchmark" one (e.g., one that always assigns a probability of 1 to the majority class and a probability of 0 to all others), you need to derive the distribution of the score of the benchmark. You can get this by bootstrapping: resample the test set, apply the benchmark to the resampled set, note the score. Repeat this many times. You now have the distribution you are looking for. Check how many of these resampled scores are smaller (since scores are better when smaller) than the score of your candidate classifier."

The great problem is: To create my test set, I manually customized my partition between the different sets of train, validation and test...

I will quickly be limited in terms of resampling because of this. My native data are very particular (I have patients, and each patient have different lesions, which contribute to the necessity of splitting the data to keep the lesions belonging to each patient together...)

EDIT : Based on the response of Lars, I will add the split flow of my work : enter image description here

I have created three sets : -train -validation -test by splitting the datas with a customized resampling :

task = as_task_classif(data, target = "LesionResponse")

#Creation of the OUTER resampling via customization
resampling_outer = rsmp("custom")
resampling_outer$instantiate(task, train = list(train_rows_outer), test = list(test_rows))

#Creation of the INNER resampling via customization
resampling_inner = rsmp("custom")
resampling_inner$instantiate(task, train = list(train_rows), test = list(valid_rows))

#train_rows_inner divide my training set in two for the optimization of the models

#train_rows_outer divide my data set in two for : the training outer and the test.

Edit : Here is my strategy following the benchmarking of multiple learners. I had these results on the validation set.

The best results were obtained with Xgboost (absolutely no surprise here), but the "task_importance" was subject to overfitting as you can see so I checked and plotted the results on the test and it confirmed that this technique of feature selection was overfitting on the validation set... enter image description here

resample_result_importance <- bmr$resample_results$resample_result[[6]]
resample_result_embarquee <- bmr$resample_results$resample_result[[4]]

prediction_xgboost_importance <- resample_result_importance$prediction()
prediction_xgboost_embarquee <- resample_result_embarquee$prediction()

pred_importance<-as.data.table(prediction_xgboost_importance)
pred_embarquee<-as.data.table(prediction_xgboost_embarquee)

levels(pred_importance$response) <- levels(pred_importance$truth)
levels(pred_embarquee$response) <- levels(pred_embarquee$truth)

# calculate confusion matrices
cm_importance <- mlr3measures::confusion_matrix(truth = pred_importance$truth, response = pred_importance$response, positive = "0")
cm_embarquee <- mlr3measures::confusion_matrix(truth = pred_embarquee$truth, response = pred_embarquee$response, positive = "0")

# print confusion matrices
print(cm_importance)
print(cm_embarquee)

# calculate AUC
auc_score_importance <- mlr3measures::auc(truth = pred_importance$truth, prob = pred_importance$`prob.0`, positive = "0")
auc_score_embarquee <- mlr3measures::auc(truth = pred_embarquee$truth, prob = pred_embarquee$`prob.0`, positive = "0")

# print AUC
print(auc_score_importance)
print(auc_score_embarquee)

# calculate p-values and confidence intervals
roc_obj_importance <- roc(pred_importance$truth, pred_importance$prob.1)
roc_obj_embarquee <- roc(pred_embarquee$truth, pred_embarquee$prob.1)

random_preds_importance <- rep(1, length(pred_importance$truth))
random_preds_embarquee <- rep(1, length(pred_embarquee$truth))

roc_obj_random_importance <- roc(pred_importance$truth, random_preds_importance)
roc_obj_random_embarquee <- roc(pred_embarquee$truth, random_preds_embarquee)

roc_test_result_importance <- roc.test(roc_obj_importance, roc_obj_random_importance)
roc_test_result_embarquee <- roc.test(roc_obj_embarquee, roc_obj_random_embarquee)

p_value_importance <- roc_test_result_importance$p.value
p_value_embarquee <- roc_test_result_embarquee$p.value

auc_ci_importance <- ci.auc(roc_obj_importance, method = "delong")
auc_ci_embarquee <- ci.auc(roc_obj_embarquee, method = "delong")

print(p_value_importance)
print(p_value_embarquee)

print(auc_ci_importance)
print(auc_ci_embarquee)

rocvals_importance <- data.frame(
  threshold = unique(pred_importance$`prob.1`),
  TPR = sapply(unique(pred_importance$`prob.1`), function(x) sum(pred_importance$truth[pred_importance$`prob.1` >= x] == "1") / sum(pred_importance$truth == "1")),
  FPR = sapply(unique(pred_importance$`prob.1`), function(x) sum(pred_importance$truth[pred_importance$`prob.1` >= x] == "0") / sum(pred_importance$truth == "0")),
  model = "importance"
)

rocvals_embarquee <- data.frame(
  threshold = unique(pred_embarquee$`prob.1`),
  TPR = sapply(unique(pred_embarquee$`prob.1`), function(x) sum(pred_embarquee$truth[pred_embarquee$`prob.1` >= x] == "1") / sum(pred_embarquee$truth == "1")),
  FPR = sapply(unique(pred_embarquee$`prob.1`), function(x) sum(pred_embarquee$truth[pred_embarquee$`prob.1` >= x] == "0") / sum(pred_embarquee$truth == "0")),
  model = "embarquee"
)

rocvals <- rbind(rocvals_importance, rocvals_embarquee)

caption_text <- paste(
  "Model importance: AUC = ", round(auc_score_importance, 3), 
  ", CI = [", round(auc_ci_importance[1], 3), ", ", round(auc_ci_importance[2], 3), "], p-value = ", round(p_value_importance, 3),
  "\nModel embarquee: AUC = ", round(auc_score_embarquee, 3),
  ", CI = [", round(auc_ci_embarquee[1], 3), ", ", round(auc_ci_embarquee[2], 3), "], p-value = ", round(p_value_embarquee, 3)
)

auc_labels <- data.frame(
  model = c("embarquee", "importance"),
  AUC = c(auc_score_embarquee, auc_score_importance),
  x = c(0.7, 0.7),
  y = c(0.5, 0.4)
)

# plot ROC curves
p <- ggplot(rocvals, aes(FPR, TPR, color = model)) +
  geom_point() +
  geom_line() +
  geom_abline(linetype = "dashed", color = "red") +
  geom_text(data = auc_labels, aes(x = x, y = y, label = paste("AUC =", round(AUC, 3)), color = model), hjust = 1, size = 4) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1)) +
  labs(
    title = "ROC Curves for XGBoost on Test Data",
    x = "1 - Specificity (FPR)",
    y = "Sensitivity (TPR)",
    color = "Model",
    caption = caption_text  # add the caption
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold", hjust = 0.5, color = "black"),
    axis.title.x = element_text(size = 12, face = "bold", color = "black"),
    axis.title.y = element_text(size = 12, face = "bold", color = "black"),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_line(color = "grey"),
    plot.caption = element_text(hjust = 0, face= "italic")  # style for the caption
  )

print(p)

Solution

  • mlr3 supports statistical tests that can be run directly on benchmark results, see the relevant section in the mlr3 book. This will also automatically correct for multiple testing, which doesn't happy if you extract p-values manually.

    To get a bootstrapping estimate as you describe, all you need to do is set the resampling method for the benchmark design to bootstrapping with the desired number of iterations.