The arrangement on how I need to run my scripts is first to run the 4 R scripts in parallel using the rstudioapi::jobRunScript()
function. Each of the scripts that is running in parallel does not import anything from any environment but instead exports the data frames created to the global environment. My 5th R script builds on the data frames created by the 4 R scripts that run in parallel and also this 5th script is running in the console. If there's a way to run the 5th script in the background rather than in the console after the first 4 R scripts are done running in parallel, that would be a lot better. I'm also trying to reduce the total running time of the whole process.
Although I was able to figure out how to run the first 4 R scripts in parallel, my task isn't completely done because I can't find a way on how to trigger the running of my 5th R script. Hope y'all can help me here
This is a bit too open for my liking. While rstudioapi
definitely can be used for running parallel tasks, it is not very versatile and does not give you very useful output. The parallel
universe is well implemented in R with several packages that provide a much simpler and better interface for doing this. Here are 3 options, which also allow for the something to be 'output' from the different files.
With the parallel package we can achieve this very simply. Simply creating a vector of files to be sourced and executing source
in each thread. The main process will lock while they are running, but if you have to wait for them to finish anyway, this doesn't really matter much.
library(parallel)
ncpu <- detectCores()
cl <- makeCluster(ncpu)
# full path to file that should execute
files <- c(...)
# use an lapply in parallel.
result <- parLapply(cl, files, source)
# Remember to close the cluster
stopCluster(cl)
# If anything is returned this can now be used.
As a side note, several packages have a similar interface to the parallel
package, which was build upon the snow
package, so it is a good baseline to have knowledge of.
An alternative to the parallel
package is the foreach
package, which gives something similar to a for-loop
interface, simplifying the interface while giving a more flexibility and automatically importing necessary libraries and variables (although it is safer to do this manually).
The foreach
package does depend on the parallel
and doParallel
packages to set up a cluster however
library(parallel)
library(doParallel)
library(foreach)
ncpu <- detectCores()
cl <- makeCluster(ncpu)
files <- c(...)
registerDoParallel(cl)
# Run parallel using foreach
# remember %dopar% for parallel. %do% for sequential.
result <- foreach(file = files, .combine = list, .multicombine = TRUE) %dopar% {
source(file)
# Add any code before or after source.
}
# Stop cluster
stopCluster(cl)
# Do more stuff. Result holds any result returned by foreach.
While it does add a few lines of code, the .combine
, .packages
and .export
makes for a very simple interface to work with parallel computing in R.
Now this is one of the more rare packages to be used. future
provides a parallel interface that is more flexible than both parallel
and foreach
allowing for asynchronous parallel programming. The implementation can however seem a bit more daunting, while the example I provide below is only scratching the surface of what is possible.
Also worth mentioning is that while the future
package does provide automatic import of functions and packages necessary to run code, experience has made me aware that this is limited only to the first level of depth in any call (sometimes less), as such exporting is still necessary.
While foreach
depends on parallel
(or similar) to start a cluster, foreach
will start one itself using all the available cores. A simple call to plan(multiprocess)
will start a multi core session.
library(future)
files <- c(...)
# Start multiprocess session
plan(multiprocess)
# Simple wrapper function, so we can iterate over the files variable easier
source_future <- function(file)
future(file)
results <- lapply(files, source_future)
# Do some calculations in the meantime
print('hello world, I am running while waiting for the futures to finish')
# Force waiting for the futures to finish
resolve(results)
# Extract any result from the futures
results <- values(results)
# Clean up the process (close down clusters)
plan(sequential)
# Run some more code.
Now this might seem quite heavy at firsts, but the general mechanism is:
plan(multiprocess)
future
(or %<-%
, which I wont go into)resolve
, which works on a single future or multiple futures in a list (or environment)value
for single futures or values
for multiple futures in a list (or environment)future
environment by using plan(sequential)
I believe these 3 packages provide interfaces to every necessary element of multiprocessing (at least on CPU) that any user needs to interface with. Other packages provide alternative interfaces while for asynchronous I am only aware of future
and promises
. In general I'd advice most users to be very careful when moving into asynchronous programming, as this can cause a whole suite of problems that are less frequent compares to synchronous parallel programming.
I hope this may help provide an alternative to the (very limiting) rstudioapi
interface, which I am fairly certain was never meant to be used for parallel programming by the users themselves, but more likely intended to perform tasks such as building a package in parallel by the interface itself.