rmachine-learningcross-validationfeature-selectionresampling

Feature Selection Using Bootstrap Resampling, LASSO and Stepwise Regression


In this paper, the authors perform radiomics feature selection for survival prediction by:

  1. Bootstrap resampling the dataset x 1000
  2. Fitting cross-validated LASSO models to each the resampled data sets
  3. Retaining the 10 most common features with non-zero coefficients across all 1000 models
  4. Fitting reverse stepwise regression using the ten selected features to the resampled datasets ( the same data sets as generated in step 1)
  5. Choosing the final features based on the most common cox-regression model.

I would like to replicate this approach (albiet for logistic regression rather than cox-regression).

I am able to use the following R code to obtain the top K features from the Lasso models using the 'boot' library:

lasso_Select <- function(x, indices){ 
   x <- x[indices,]
   y <- x$Outcome
   x = subset(x, select = -Outcome)
   x2 <- as.matrix(x)
   fit <- glmnet(x2, y , family="binomial",alpha=1, standardize=TRUE)
   cv <- cv.glmnet(x2, y, family="binomial",alpha=1,  standardize=TRUE)
   fit <- glmnet(x2, y, family="binomial",alpha=1, lambda=cv$lambda.min,  standardize=TRUE)
     return(coef(fit)[,1])
   }

myBootstrap <- boot(scaled_train, lasso_Select, R = 1000, parallel = "multicore", ncpus=5)

However, I don't believe I can access the individual resampled datasets to then run the multiple logistic regression models and choose the most common.

Any advice on how to approach this?


Solution

  • As the manual page for boot() explains:

    For most of the boot methods the resampling is done in the master process, but not if simple = TRUE nor sim = "parametric".

    As you are not doing parametric bootstrapping and you don't need to specify simple = TRUE, the code displayed when you type boot::boot at the R prompt shows how the resampled data indices are generated. The critical code is:

    if (!simple) 
                i <- index.array(n, R, sim, strata, m, L, weights)
    

    where n is the number of data rows, R is the number of bootstrap samples, and the other arguments are defined in the call to boot() and don't seem to apply to your situation. Typing boot:::index.array shows the code for that function, which in turn calls boot:::ordinary.array for your situation. In your situation, i is just a matrix showing which data rows to use for each bootstrap sample.

    It should be reasonably straightforward to tweak the code for boot() to return that matrix of indices along with the other values the function normally returns.

    An alternative might be to return indices directly in your lasso_Select() function, although I'm not sure how well the boot() function would handle that.