rbenchmarkingfeature-selectionlasso-regressionreproducible-research

How to re-write my iterative LASSO via lars() code run on N datasets to using cv.lars instead so I don't have to hardcode the lambda value


Here is a link to the GitHub Repository for this project.

I wrote R scripts that iteratively estimate N LASSO on each of N randomly generated synthetic datasets within a file folder because LASSO is one of the 3 benchmark variable selection algorithms the principal researcher and I agreed upon as part of a statistical learning paper exploring the properties and characteristics of a new variable selection algorithm being proposed by that principal author and I am running the 3 benchmark methods in R on a large set of N datasets he created. Here are the lines of code I wrote originally to estimate/fit those N LASSOs (which does in fact work) from the LASSO using Lars (regression part only).R file in the folder Stage 2 scripts in the GitHub Repo:

# This function fits all n LASSO regressions for/on
# each of the corresponding n datasets stored in the object
# of that name, then outputs standard regression results which 
# are typically called returned for any regression ran using R
set.seed(11)     # to ensure replicability
system.time(LASSO.Lars.fits <- lapply(X = datasets, function(i) 
  lars(x = as.matrix(select(i, starts_with("X"))), 
       y = i$Y, type = "lasso", normalize =  FALSE)))

# This stores and prints out all of the regression 
# equation specifications selected by LASSO when called
set.seed(11)     # to ensure replicability
system.time(LASSO.Lars.Coeffs <- lapply(LASSO.Lars.fits, 
                            function(i) predict(i, 
                                                x = as.matrix(dplyr::select(i, starts_with("X"))), 
                                                s = 0.1, mode = "fraction", 
                                                type = "coefficients")[["coefficients"]]))

IVs.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(i) names(i[i > 0]))
IVs.Not.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(j) names(j[j == 0]))

The problem is that the s argument in the predict function in LASSO.Lars.Coeffs being set equal to 0.1 is arbitrary. Thus, I need to re-write it using cv.lars and extract the variables selected using k-fold cross validation in each dataset. I have already tried to do just that myself (using GPT 4.0 to give me ideas) for hours and hours last Wednesday and Thursday, here the last different way to do it I tried before moving on to other things (it comes from the LASSO using cv.lars (regression part only).R script in the Stage 2 scripts folder):

# Fit the models using cv.lars() instead of lars()
set.seed(11)  # to ensure replicability
system.time(LASSO.Lars.fits <- lapply(X = datasets, function(i) 
  cv.lars(x = as.matrix(select(i, starts_with("X"))), 
          y = i$Y, type = "lasso", normalize =  FALSE, trace = TRUE)))
print(LASSO.Lars.fits)

LASSO.Lars.Coeffs <- lapply(LASSO.Lars.fits, 
                            function(i) i$beta[, which.min(i$cv.error)])

# Extract the names of variables with non-zero coefficients
IVs.Selected.by.Lars <- lapply(LASSO.Lars.Coeffs, function(i) names(i[i != 0]))

But, unfortunately, this just returns a list of N blanks/nulls rather than a list of N sets of selected variables as it does for my original version with the arbitrary choice of s = 0.1. What have I done wrong here, and is there a simple way to fix it?

p.s. Here is what the outputs stored in LASSO.Lars.Coeffs for all estimated models:

[[671]]
NULL

[[672]]
NULL

[[673]]
NULL

[[674]]
NULL

[[675]]
NULL

[[676]]
NULL

[[677]]
NULL

Solution

  • It turns out this is not possible to do, it is only possible to do this using the glmnet library via cv.glmnet which I have already done. That is probably why no answers have been attempted. I am going to leave this question up instead of deleting it thought because I wasted 3 weeks trying and would like to spare anyone else in the future the trouble of doing so.