rperformanceregressionruntimecoding-efficiency

Assessing & improving the runtime efficiency for loading a large number of CSV datasets into R and assigning all of them to a list object


All of the code included in this question can be found in the 'LASSO code(Version for Antony)' script in the GitHub Repository for this project.

This is all part of a research project with a collaborator where we are exploring the properties, and measuring the performance of a new statistical learning algorithm for optimal variable selection. Its performance measured and compared to three optimal variable selection Benchmarks (LASSO, Backward Stepwise, & Forward Stepwise Regression) after all 4 have been run on the same set of 260,000 synthetic datasets, all of which have the same number of synthetic observations on the same number of columns, and were generated via Monte Carlo Simulation in such a way that the distribution each observation comes from and the 'true' underlying statistical properties characterizing each dataset is known by construction.

So, all that must be done is run all 4 algorithms on this massive file folder full of the 260k csvs files, which on my system has been named 'datasets folder'. After loading all of the necessary libraries, this is all of my code before the command to load/import the datasets:

# these 2 lines together create a simple character list of 
# all the file names in the file folder of datasets you created
folderpath <- "C:/Users/Spencer/Documents/EER Project/datasets folder"
filepaths_list <- list.files(path = folderpath, full.names = TRUE, 
                             recursive = TRUE)
# reformat the names of each of the csv file formatted datasets
DS_names_list <- basename(filepaths_list)
DS_names_list <- tools::file_path_sans_ext(DS_names_list)

# sort both of the list of file names so that they are in the proper order
my_order = DS_names_list |> 
  # split apart the numbers, convert them to numeric 
  strsplit(split = "-", fixed = TRUE) |>  unlist() |> as.numeric() |>
  # get them in a data frame
  matrix(nrow = length(DS_names_list), byrow = TRUE) |> as.data.frame() |>
  # get the appropriate ordering to sort the data frame
  do.call(order, args = _)
DS_names_list = DS_names_list[my_order]
filepaths_list = filepaths_list[my_order]

And this is the code I am using to load/import the 260k datasets into my Environment so I can run my LASSO Regressions on each of them and tally up how well they do:

# this line reads all of the data in each of the csv files 
# using the name of each store in the list we just created
CL <- makeCluster(detectCores() - 1L)
clusterExport(CL, c('filepaths_list'))
system.time( datasets <- lapply(filepaths_list, read.csv) )
stopCluster(CL)

... here is the problem though, I hit run on all of the included code above besides stopCluster(CL) because despite hitting Run on system.time( datasets <- lapply(filepaths_list, read.csv) ) over 54 hours ago now, it has not finished loading my datasets into RStudio's WorkSpace yet!! I have a 2022 HP laptop of medium quality which I upgraded from the 12 gb of RAM it came with to 32 gb, and back when I did this same operation several months ago with 58,500 datasets instead of 260,000, it would usually only take about 2ish hours for datasets <- lapply(filepaths_list, read.csv) to run, so I really feel like something must be wrong here.

p.s. I know there is nothing wrong syntactically per se, because I have the same script in other RStudio Windows using folders with only 10 & 40 datasets in them instead just to make sure that is not the problem. One more thing, I also tried doing this a few days ago without the parellel part or the system.time() part, but I accidentally let my laptop unplugged for like 90 minutes and it was working so hard that it went dead in that time.


Solution

  • Is there a faster way than fread() to read big data?

    check this out for speed. fread and selecting columns may improve your speed. The package microbenchmark can help you run tests too to determine the fastest methods.