The R scripts and a reduced version of the file folder with the multiple csv file-formatted datasets in it can all be found in my GitHub Repository found in this Link.
In my script called 'LASSO code', after loading a file folder full of N csv file-formatted datasets into R and assigning them all to a list called 'datasets', I ran the following code to fit N LASSO Regressions, one to each of the datasets:
set.seed(11) # to ensure replicability
LASSO_fits <- lapply(dfs, function(i)
enet(x = as.matrix(select(i, starts_with("X"))),
y = i$Y, lambda = 0, normalize = FALSE))
Now, I would like to replicate this same process for a Backward Elimination Stepwise Regression we'll keep it simple by just using the step() function from the stats library) using another apply function rather than having to use a loop. The problem is this, the only was I know how to do this is by initializing or prepping it before running it by first establishing:
set.seed(100) # for reproducibility
full_fits <- vector("list", length = length(dfs))
Backward_Stepwise_fits <- vector("list", length = length(dfs))
And only then fitting all of the Backward_Stepwise_fits, but I cannot figure out how to put both full_fits and Backward_Stepwise_fits into the same apply function, the only way I can think of would be to use a for loop and stack them on top of each other inside of it, but that would be very computationally inefficient. And the number of datasets N I will be running both of these on is 260,000!
I wrote a for-loop that does in fact run, but it took over 12 hours to finish running on just 58,500 datasets which is unacceptably slow. The code I used for it is the following:
set.seed(100) # for reproducibility
for(i in seq_along(dfs)) {
full_fits[[i]] <- lm(formula = Y ~ ., data = dfs[[i]])
Backward_Stepwise_fits[[i]] <- step(object = full_fits[[i]],
scope = formula(full_fits[[i]]),
direction = 'backward', trace = 0) }
I have tried the following, but get the corresponding error message in the Console:
> full_model_fits <- lapply(dfs, function(i)
+ lm(formula = Y ~ ., data = dfs))
Error in terms.formula(formula, data = data) :
duplicated name 'X1' in data frame using '.'
Ever thought about parallelizing the whole thing?
First, you could define the code more succinctly.
system.time(
res <- lapply(lst, \(X) {
full <- lm(Y ~ ., X)
back <- step(full, scope=formula(full), dir='back', trace=FALSE)
})
)
# user system elapsed
# 3.895 0.008 3.897
system.time(
res1 <- lapply(lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
# user system elapsed
# 3.820 0.016 3.833
stopifnot(all.equal(res, res1))
The results are equal, but no time difference.
Now, using parallel::parLapply
.
library(parallel)
CL <- makeCluster(detectCores() - 1L)
system.time(
res2 <- parLapply(CL, lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
)
# user system elapsed
# 0.075 0.032 0.861
stopCluster(CL)
stopifnot(all.equal(res, res2))
On this machine about 4.5 times faster.
If we now want to run a forward step-wise selection using res2
as scope=
, we need parallel::clusterMap
, the multivariate variant of parLapply
:
# CL <- makeCluster(detectCores() - 1L)
res3 <- clusterMap(CL, \(X, Y) step(lm(Y ~ 1, X), scope=formula(Y), dir='forw', trace=FALSE),
lst, res2)
# stopCluster(CL)
NB: This yielded the same coefficients as using the for
loop you showed in comments:
stopifnot(all.equal(lapply(FS_fits, coef), unname(lapply(res3, coef))))
Your error duplicated name 'X1' in data frame using '.'
means, that in some of your datasets there are two columns named "X1"
. Here's how to find them:
names(lst$dat6)[9] <- 'X1' ## producing duplicated column X1 for demo
sapply(lst, \(x) anyDuplicated(names(x)))
# dat1 dat2 dat3 dat4 dat5 dat6 dat7 dat8 dat9 dat10 dat11
# 0 0 0 0 0 9 0 0 0 0 0
# ...
Result shows, in dataset dat6
the 9
th column is the (first) duplicate. All others are clean.
Data:
set.seed(42)
n <- 50
lst <- replicate(n, {dat <- data.frame(matrix(rnorm(500*30), 500, 30))
cbind(Y=rowSums(as.matrix(dat)%*%rnorm(ncol(dat))) + rnorm(nrow(dat)), dat)}, simplify=FALSE) |>
setNames(paste0('dat', seq_len(n)))