rmachine-learningbenchmarkingmontecarloreproducible-research

How to run a single regression function in R using a single function on each of a large number of datasets within a folder


I need to run the enet() function from the elasticnet library in RStudio on each of these 47,000 datasets individually because they have been created in such a way that we know what the real underlying population for each dataset is and want to see how often the new algorithm finds that vs LASSO and Stepwise and the runtime of each.

I have absolutely no idea how to do this or even what search terms to use to look it up, I have already tried in both Google and Bing several times. I believe that the only packages my code as it stands requires are:

This is my code to run the LASSO (obviously, I made up the dataframe names for the x & y arguments in the enet() function for this post/question lol):

## Attempt 2: Run a LASSO regression using 
## the enet function from the elasticnet library
set.seed(11)   
library(elasticnet)
enet_LASSO <- enet(x = as.matrix(df_all_obs_on_all_of_the_IVs), 
                                  y = df_all_obs_on_the_DV, 
                                  lambda = 0, normalize = FALSE)
print(enet_LASSO)
# In order to ascertain which predictors/regressors are still
# included in the version of the model after running a 
# LASSO regression on it for the purpose of variable selection, 
# I am going to use the 'predict' method from the stats package.
LASSO_coeffs <- predict(enet_LASSO, 
                         x = as.matrix(df_all_obs_on_all_of_the_IVs),
                         s = 0.1, mode = "fraction", type = "coefficients")
print(LASSO_coeffs)

Optional background context & motivation: I am in the middle of a research project and in order to compare a new statistical learning procedure for choosing the optimal regression specification, I am running this new algorithm as a Monte Carlo Experiment in which I run it and two benchmarks (LASSO & Stepwise) on a synthetic dataset my collaborator created for me which consists of a multiple GB file folder filled with 47,000 individual csv files.


Solution

  • Listing all the files, applying a read function, and a corresponding enet should do the trick provided you have enough RAM. Here is what the code would look like:

    file_list <- list.files(directory_path, full.names = TRUE, recursive = TRUE)
    csvs <- lapply(file_list, function(x) read_csv(x))
    names(csvs) <- file_list
    tib <- tibble::enframe(csvs) %>%
      mutate(enet_column = lapply(value, function(y) {your_function_contents_relative_to_y_here})) %>%
      tidyr::unnest() #This step is optional
    

    You may wish to just lapply your custom function to the csvs list instead, and form the tibble at the end. One lapply to form a list of enet objects, another to store the lasso coefficients. Let me know if you have any questions.