rmultithreadingsimulationsupercomputers

How do I save output from a large simulation in R? (multiple nodes, safe access)


I am doing a large simulation for a research project--simulating 1,000 football seasons and analyzing the results. As the seasons will be spread across multiple nodes, I need an easy way to save my output data into a file (or files) to access later. Since I can't control when the nodes will finish, I can't have them all trying to write to the same file at the same time, but if they all save to a different file, I would need a way to aggregate all the data easily afterward. Thoughts?


Solution

  • I do not know if this question was asked already. But here is what I do in my research. You can loop through the file names and aggregate them into one object like so

    require(data.table)
    dt1 <- data.table()
    for (i in 1:100) {
      k <- paste0("C:/chunkruns/dat",i,"/dt.RData")
      load(k)
      dt1 <- rbind(dt1,dt)
    }
    
    agg.data <- dt1
    rm(dt1)
    

    The above code assumes that all your files are saved in different folders but with same file name.

    Or else, you can use the following to identify file paths matching a pattern and then combine them

    require(data.table)
    # Get the list of files and then read the files using read.csv command
    k <- list.files(path = "W:/chunkruns/dat", pattern = "Output*", all.files = FALSE, full.names = TRUE, recursive = TRUE)
    m <- lapply(k, FUN = function (x) read.csv(x,skip=11,header = T))
    agg.data <- rbindlist(m)
    rm(m)