mlr3

I m trying to emulate cv.glmnet (family="cox") call for a model with splines using mlr3


The following code throws an error. Thank you in advance for your help.

require(mlr3)
require(mlr3proba)
require(mlr3learners)
require(mlr3tuning)
require(mlr3pipelines)
require(mlr3verse)
require(mlr3viz)
#- require(mlr3fda)require(mlr3verse)
require(survival)
require(glmnet)
require(splines)

Simulate a regression dataset


set.seed(123)
n <- 100
p <- 3
X <- matrix(rnorm(n * p), nrow = n, ncol = p)
time <- rexp(n, rate = 1)
status <- sample(0:1, n, replace = TRUE)
df <- as.data.frame(X)
df$time <- time
df$status <- status

Create a survival task

task <- TaskSurv$new("survival_task", backend = df, time = "time", event ="status")
task

#---- Perform initial split
initial_split <- rsmp("holdout")
initial_split$instantiate(task)

Separate the data into training and testing sets


train_task <- task$clone()$filter(initial_split$train_set(1))  
test_task  <- task$clone()$filter(initial_split$test_set(1)

Load the glmnet learner


learner <- lrn("surv.glmnet")
#---- Define the hyperparameter search space  
search_space <- ps(   alpha  = p_dbl(lower = 0, upper = 1),   
                      lambda = p_dbl(lower = 0.0001, upper = 0.1, logscale = TRUE)
 )

#---- Define objects needed for tuning 
#---- Create a Pipeline Using for Splines Transformation
#- library(paradox)
#- Define a function to apply splines transformation
apply_splines <- function(x) {
     as.data.table(splines::ns(x, df = 3))  
}  

#- Define the pipeline graph for applying splines transformation

graph <- gunion(list(
      po("colapply", id = "spline_V1", applicator = apply_splines,        
              affect_columns = selector_name("V1")),
      po("colapply", id = "spline_V2", applicator = apply_splines,       
              affect_columns = selector_name("V2")),
      po("colapply", id = "spline_V3", applicator = apply_splines,
              affect_columns = selector_name("V3"))  )) %>>% 
      po("featureunion") %>>% 
    learner  

#-- Create the pipeline learner 

    pipeline <- GraphLearner$new(graph)

    #--- Define the resampling strategy for tuning
    resampling <- rsmp("cv", folds = 5)
    
    # Define the performance measure for survival analysis
    measure <- msr("surv.cindex")
    
    # Create the tuner
    tuner <- tnr("grid_search", resolution = 5)
    #-- Define the AutoTuner
    at <- AutoTuner$new(
    learner = pipeline,
    resampling = resampling,
    measure = measure,
    search_space = search_space,
    terminator = trm("evals", n_evals = 20),
    tuner = tuner
    )
    
    # Train the AutoTuner on the training set
    at$train(train_task)

... part of the output is omitted

INFO  [18:08:27.654] [mlr3] Finished benchmark
INFO  [18:08:27.692] [bbotk] Result of batch 20:
INFO  [18:08:27.694] [bbotk]  alpha    lambda surv.cindex warnings errors runtime_learners
INFO  [18:08:27.694] [bbotk]   0.25 -7.483402   0.4561424        0      0             1.52
INFO  [18:08:27.694] [bbotk]                                 uhash
INFO  [18:08:27.694] [bbotk]  a491c12c-47e5-448b-b365-34aa53350e01
INFO  [18:08:27.711] [bbotk] Finished optimizing after 20 evaluation(s)
INFO  [18:08:27.712] [bbotk] Result:
INFO  [18:08:27.714] [bbotk]  alpha    lambda learner_param_vals  x_domain surv.cindex
INFO  [18:08:27.714] [bbotk]  <num>     <num>             <list>    <list>       <num>
INFO  [18:08:27.714] [bbotk]   0.75 -4.029524          <list[8]> <list[2]>   0.4561424
Error in self$assert(xs, sanitize = TRUE) : 
  Assertion on 'xs' failed: Parameter 'alpha' not available. Did you mean 'spline_V1.applicator' / 'spline_V1.affect_columns' / 'spline_V2.applicator'?.

Solution

  • The issue explained

    You define a GraphLearner that inside somewhere has a learner. When you define the Autotuner you provide the search_space of the learner not of the learner inside the larger GraphLearner.

    The difference is that for the learner, the parameters that need tuning are defined as alpha and lamdba. Inside the GraphLearner they are defined as surv.glmnet.alpha and surv.glmnet.lambda. This triggers warnings as many lambdas are actually fitted (pretty much the search_space is not used at all in your case I think). You can see that if in your Autotuner you just used the learner, then things would work normally.

    This is more general: the GraphLearner constructs <pipeop_id>.<arg_name> to be able to differentiate between parameters of the different pipeops.

    Solution(s)

    1. Suggested: Define the search_space with the learner (and when the GraphLearner gets constructed, the prefix of the parameters is automatically added)
    learner = lrn("surv.glmnet")
    learner$param_set$set_values(.values = list(
      alpha = to_tune(0, 1),
      lambda = to_tune(p_dbl(0.001, 0.1, logscale = TRUE))
    ))
    

    Note that in this case you DO NOT need to use the search_space argument in AutoTuner.

    1. Manually define the search space with the suffixes directly given that you don't change the id = surv.glmnet of the learner, ie:
    search_space = ps(
      surv.glmnet.alpha  = p_dbl(lower = 0, upper = 1),   
      surv.glmnet.lambda = p_dbl(lower = 0.0001, upper = 0.1, logscale = TRUE)
    )
    

    Suggestions

    # simple train/test split
    part = partition(task)
    at$train(task, row_ids = part$train)
    
    at = auto_tuner(
      learner = pipeline, # better name => grlrn, it has the `learner` inside with "solution No 1" above
      resampling = resampling,
      measure = measure,
      tuner = tuner,
      term_evals = 20
    )