I'm testing out the targets
package and am running into a problem with customizing parallelization. My workflow has two steps, and I'd like to parallelize the first step over 4 workers and the second step over 16 workers.
I want to know if I can solve the problem by calling tar_make_future()
, and then specifying how many workers each step requires in the tar_target
calls. I've got a simple example below, where I'd like the data
step to execute with 1 worker, and the sums
step to execute with 3 workers.
library(targets)
tar_dir({
tar_script({
library(future)
library(future.callr)
library(dplyr)
plan(callr)
list(
# Goal: this step should execute with 1 worker
tar_target(
data,
data.frame(
x = seq_len(6),
id = rep(letters[seq_len(3)], each = 2)
) %>%
group_by(id) %>%
tar_group(),
iteration = "group"
),
# Goal: this step should execute with 3 workers, in parallel
tar_target(
sums,
sum(data$x),
pattern = map(data),
iteration = "vector"
)
)
})
tar_make_future()
})
I know that one option is to configure the parallel backend separately within each step, and then call tar_make()
to execute the workflow serially. I'm curious about whether I can get this kind of result with tar_make_future()
.
I would recommend that you call tar_make_future(workers = <max_parallel_workers>)
and let targets
figure out how many workers to run in parallel. targets
automatically figures out which targets can run in parallel and which need to wait for upstream dependencies to finish. In your case, some of the data
branches may finish before others, in which case sum
can start right away. In other words, some sum
branches will start running before other sum
branches can start, and you can trust targets
to scale up transient workers when the need arises. The animation at https://books.ropensci.org/targets/hpc.html#future may help visualize this. If you were to micromanage the parallelism for data
and sum
separately, you would likely have to wait for all of data
to finish before any of sum
can start, which could take a long time.