mlr3

RFE Termination Using RMSE with AutoFSelector


To mimic how caret performs RFE and select features that produce the lowest RMSE, it was suggested to use the archive.

I am using AutoFSelector and nested resampling with the following code:


ARMSS<-read.csv("Index ARMSS Proteomics Final.csv", row.names=1)

set.seed(123, "L'Ecuyer")

task = as_task_regr(ARMSS, target = "Index.ARMSS")

learner = lrn("regr.ranger", importance = "impurity")

set_threads(learner, n = 8)

resampling_inner = rsmp("cv", folds = 7)
measure = msr("regr.rmse")
terminator = trm("none")

at = AutoFSelector$new(
  learner = learner,
  resampling = resampling_inner,
  measure = measure,
  terminator = terminator,
  fselect = fs("rfe", n_features = 1, feature_fraction = 0.5, recursive = FALSE),
  store_models = TRUE)

resampling_outer = rsmp("repeated_cv", folds = 10, repeats = 10)

rr = resample(task, at, resampling_outer, store_models = TRUE)

Should I use the extract_inner_fselect_archives() command to identify each iteration with the smallest RMSE and the features that were selected then rereun the code above with the n_features argument changed? How do I reconcile differences across iterations in the number of features and/or the features selected?


Solution

  • Nested resampling is a statistical procedure to estimate the predictive performance of the model trained on the full dataset, it is not a procedure to select optimal hyperparameters. Nested resampling produces many hyperparameter configurations which should not be used to construct a final model.

    mlr3book Chapter 4 - Optimization.

    The same is true for feature selection. You don't select a feature set with nested resampling. You estimate the performance of the final model.

    it was suggested to use the archive

    Without nested resampling, you just call instance$result or at$fselect_result to get the feature subset with the lowest rmse.