rperformancefor-loopmachine-learningfeature-selection

Replacing a for-loop which iteratively runs Backward Stepwise Regressions on a list of dfs in R with an apply function to reduce compute time


The R scripts and a reduced version of the file folder with the multiple csv file-formatted datasets in it can all be found in my GitHub Repository found in this Link.

In my script called 'LASSO code', after loading a file folder full of N csv file-formatted datasets into R and assigning them all to a list called 'datasets', I ran the following code to fit N LASSO Regressions, one to each of the datasets:

set.seed(11)     # to ensure replicability
LASSO_fits <- lapply(dfs, function(i) 
               enet(x = as.matrix(select(i, starts_with("X"))), 
                    y = i$Y, lambda = 0, normalize = FALSE))

Now, I would like to replicate this same process for a Backward Elimination Stepwise Regression we'll keep it simple by just using the step() function from the stats library) using another apply function rather than having to use a loop. The problem is this, the only was I know how to do this is by initializing or prepping it before running it by first establishing:

set.seed(100)      # for reproducibility
full_fits <- vector("list", length = length(dfs))
Backward_Stepwise_fits <- vector("list", length = length(dfs))

And only then fitting all of the Backward_Stepwise_fits, but I cannot figure out how to put both full_fits and Backward_Stepwise_fits into the same apply function, the only way I can think of would be to use a for loop and stack them on top of each other inside of it, but that would be very computationally inefficient. And the number of datasets N I will be running both of these on is 260,000!

I wrote a for-loop that does in fact run, but it took over 12 hours to finish running on just 58,500 datasets which is unacceptably slow. The code I used for it is the following:

set.seed(100)      # for reproducibility
for(i in seq_along(dfs)) {
  full_fits[[i]] <- lm(formula = Y ~ ., data = dfs[[i]])
  Backward_Stepwise_fits[[i]] <- step(object = full_fits[[i]], 
                        scope = formula(full_fits[[i]]),
                        direction = 'backward', trace = 0) }

I have tried the following, but get the corresponding error message in the Console:

> full_model_fits <- lapply(dfs, function(i)
+   lm(formula = Y ~ ., data = dfs))
Error in terms.formula(formula, data = data) : 
duplicated name 'X1' in data frame using '.'

Solution

  • Ever thought about parallelizing the whole thing?

    First, you could define the code more succinctly.

    system.time(
      res <- lapply(lst, \(X) {
        full <- lm(Y ~ ., X)
        back <- step(full, scope=formula(full), dir='back', trace=FALSE)
      })
    )
    #  user  system elapsed 
    # 3.895   0.008   3.897 
    
    system.time(
      res1 <- lapply(lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
    )
    #  user  system elapsed 
    # 3.820   0.016   3.833 
    
    stopifnot(all.equal(res, res1))
    

    The results are equal, but no time difference.

    Now, using parallel::parLapply.

    library(parallel)
    
    CL <- makeCluster(detectCores() - 1L)
    
    system.time(
      res2 <- parLapply(CL, lst, \(X) step(lm(Y ~ ., X), dir='back', trace=FALSE))
    )
    #  user  system elapsed 
    # 0.075   0.032   0.861 
    
    stopCluster(CL)
    
    stopifnot(all.equal(res, res2))
    

    On this machine about 4.5 times faster.

    If we now want to run a forward step-wise selection using res2 as scope=, we need parallel::clusterMap, the multivariate variant of parLapply:

    # CL <- makeCluster(detectCores() - 1L)
    
    res3 <- clusterMap(CL, \(X, Y) step(lm(Y ~ 1, X), scope=formula(Y), dir='forw', trace=FALSE),
                       lst, res2)
    
    # stopCluster(CL)
    

    NB: This yielded the same coefficients as using the for loop you showed in comments:

    stopifnot(all.equal(lapply(FS_fits, coef), unname(lapply(res3, coef))))
    

    Your error duplicated name 'X1' in data frame using '.' means, that in some of your datasets there are two columns named "X1". Here's how to find them:

    names(lst$dat6)[9] <- 'X1'  ## producing duplicated column X1 for demo 
    
    sapply(lst, \(x) anyDuplicated(names(x)))
    # dat1  dat2  dat3  dat4  dat5  dat6  dat7  dat8  dat9 dat10 dat11 
    # 0     0     0     0     0     9     0     0     0     0     0 
    # ...
    

    Result shows, in dataset dat6 the 9th column is the (first) duplicate. All others are clean.


    Data:

    set.seed(42)
    n <- 50
    lst <- replicate(n, {dat <- data.frame(matrix(rnorm(500*30), 500, 30))
    cbind(Y=rowSums(as.matrix(dat)%*%rnorm(ncol(dat))) + rnorm(nrow(dat)), dat)}, simplify=FALSE) |> 
      setNames(paste0('dat', seq_len(n)))