rforeachparallel-processingdoparallel

Append CSV file inside foreach loop


I have a parallelized foreach loop code in R that generates a list of dataframes. I want to save and append a .csv file in each iteration, in order to get a final .csv files containing the complete list. I know I can take the classical approach and rbind the output generated by foreach to get the dataframe. The problem is that I might run into memory issues since I'm using a big dataframe. I provide some example code:

foreach (l=1:length(list), .packages = c("bio3d", 
                                         "rJava", 
                                         "rcdk", 
                                         "ChemmineR")) %dopar% {
  **some code here that generates a df named rmsd from a list**

  rmsd <- do.call(rbind, rmsd)
}

I need a fast and safe way to save rmsd in each iteration and append it to a csv named "results-rmsd.csv" without row.names. Thank you very much in advance for your help!

I already tried previous questions related to the same issue (see here) but didn't work for me.


Solution

  • (Sorry, I wrote this before realizing you already know this. So, I've added another suggest at the end)

    You always want to work with the return value of foreach(). A good rule of thumb is that if your code calls foreach() without assign its value to a variable, you're probably attempting something you shouldn't.

    So, use:

    res <- foreach(l=1:length(list), .packages = c("bio3d", 
                                             "rJava", 
                                             "rcdk", 
                                             "ChemmineR")) %dopar% {
      **some code here that generates a df named rmsd from a list**
    
      rmsd
    }
    res <- do.call(rbind, res)
    

    Instead of that last line, you can use res <- foreach(..., .combine = rbind) ... as an alternative.

    I know I can take the classical approach and rbind the output generated by foreach to get the dataframe. The problem is that I might run into memory issues since I'm using a big dataframe.

    If this is the case, you probably have to save the results to file and read that back after foreach() completes. Something like:

    files <- foreach(l=1:length(list), .packages = c("bio3d", 
                                             "rJava", 
                                             "rcdk", 
                                             "ChemmineR")) %dopar% {
      **some code here that generates a df named rmsd from a list**
    
      file <- tempfile(fileext = ".rds")
      saveRDS(rmsd, file = file)
      file
    }
    
    res <- lapply(files, FUN = readRDS)
    res <- do.call(rbind, res)
    

    If this is still not sufficient because of your memory limitations, then you need to figure out a way to work with only a subset of the data at anytime.

    I need a fast and safe way to save rmsd in each iteration and append it to a csv named "results-rmsd.csv" without row.names.

    You cannot have multiple tasks appending to the same file when running in parallel. They will write on top of each other and interweave the output. You need to write to separate files, that you then merge sequentially.