I have a large dataframe 54k rows, 38k columns. My goal is to run a simple function, test_func, on this dataframe. test_func
requires the use of 8 index columns that contain generic information, such as dates and times as well as numeric values. I then have 38k different columns of integer data (0,1) that I want to run test_func() on one after the other and store the results.
I had originally assumed the best approach to do this was to take my large dataframe and split it into a list of dataframes, where each element of the list has a dataframe that contains the index columns plus one of the variable columns to be tested on the current iteration. I thought this was the best approach so as to not have to manipulate the large dataframe on each loop iteration.
The splitting of the large dataframe was achieved via lapply
using the following code that came from help received in this post (splitting a dataframe into list of data frames by column while keeping index columns and a single variable column).
set.seed(123)
df <- data.frame(a = rnorm(10, 5, 1), b = rnorm(10, 5, 1), c = rnorm(10, 5, 1), z = rnorm(10, 5, 1), y = rnorm(10, 5, 1), x = rnorm(10, 5, 1))
split_df_by_col <- function(df, index_cols) {
cols_to_split <- setdiff(names(df), index_cols)
lapply(cols_to_split, \(col) cbind(df[, c(index_cols, col)], type = col))
}
df_lst <- split_df_by_col(df, index_cols = c("a", "b", "c"))
However, when I apply this to my dataframe which takes up 9.303434 GB of memory, my list of dataframes, df_lst now takes up 223.198 GB. I then pass df_lst to a foreach()
function which iterates over each dataframe in the list using parallel processing as follows
# set up cluster
ncores <- detectCores(logical = FALSE) - 1
myCluster <- makeCluster(ncores, type = "FORK", outfile="")
registerDoParallel(myCluster)
run <- foreach(i = seq_along(df_lst), .errorhandling='pass', %dopar% {
flt_dat <- df_lst[[i]]
res <- test_function()
return(res)
}
# shut down cluster
stopCluster(myCluster)
Unfortunately this causes R to crash and the run to fail. I know that the run works perfectly on smaller datasets as Ive tested it on dataframes with 5k cols and it works fine.
What is the best approach to this problem?
Avoid unnecessary copies of data.
This should illustrate how you can approach this:
library(foreach)
ix <- df[, index_cols]
dat <- df[, !colnames(df) %in% index_cols]
foreach(x = dat, n = names(dat)) %do% {
cbind(ix, setNames(list(x), n))
}
You don't share the function but it might be better to focus on optimizing that function's performance and redesign the whole approach and get rid of the need for a foreach
loop. If you go that route, you should probably melt
your data.frame as a first step.