rr-futuretargets-r-package

Can I set different degrees of parallelism for different targets with the R {targets} package?


I'm testing out the targets package and am running into a problem with customizing parallelization. My workflow has two steps, and I'd like to parallelize the first step over 4 workers and the second step over 16 workers.

I want to know if I can solve the problem by calling tar_make_future(), and then specifying how many workers each step requires in the tar_target calls. I've got a simple example below, where I'd like the data step to execute with 1 worker, and the sums step to execute with 3 workers.

library(targets)

tar_dir({
  tar_script({
    library(future)
    library(future.callr)
    library(dplyr)

    plan(callr)

    list(
      # Goal: this step should execute with 1 worker
      tar_target(
        data,
        data.frame(
          x = seq_len(6),
          id = rep(letters[seq_len(3)], each = 2)
        ) %>%
          group_by(id) %>%
          tar_group(),
        iteration = "group"
      ),
      # Goal: this step should execute with 3 workers, in parallel
      tar_target(
        sums,
        sum(data$x),
        pattern = map(data),
        iteration = "vector"
      )
    )
  })
  tar_make_future()
})

I know that one option is to configure the parallel backend separately within each step, and then call tar_make() to execute the workflow serially. I'm curious about whether I can get this kind of result with tar_make_future().


Solution

  • I would recommend that you call tar_make_future(workers = <max_parallel_workers>) and let targets figure out how many workers to run in parallel. targets automatically figures out which targets can run in parallel and which need to wait for upstream dependencies to finish. In your case, some of the data branches may finish before others, in which case sum can start right away. In other words, some sum branches will start running before other sum branches can start, and you can trust targets to scale up transient workers when the need arises. The animation at https://books.ropensci.org/targets/hpc.html#future may help visualize this. If you were to micromanage the parallelism for data and sum separately, you would likely have to wait for all of data to finish before any of sum can start, which could take a long time.