rdrake-r-package

Best practice for multiple subplans in drake R


Hi I am new to the drake R package and would like to hear some opinions on best practice in using subtasks to manage a large project. A simplified structure of my project has two parts: 1) data cleaning and 2) modeling. They are cascaded in the sense that I do data cleaning first, then I rarely go back when I start the modeling part.

I think the approach suggested by the manual is:

source("functions_1.R") # for plan_1
plan1 <- drake_plan(
    # many middle steps to create
    foo = some_function()
    foo_1 = fn_1(foo)
    foo_2 = fn_2(foo_1)
    for_analysis = data_cleaning_fn()
)
plan2 <- drake_plan(
    # I would like to use the target name foo_1 again, but not the same object as they were defined in plan1. 
    # What I want:
    # foo_1 = fn_new_1(for_analysis) # this is different from above defined
    # result = model_fn(for_1)

    # What I actually did
    foo_new_1 = fn_new_1(for_analysis) # I have to define a new name different from foo_1
    result = model_fn(foo_new_1)
)
fullplan <- bind_plans(plan1,plan2)
make(fullplan)

One problem I had in the above workflow is that I have a lot of intermediate targets defined for plan1, but they are useless in plan2.

  1. Is there a way that I can have a "clean namespace" in plan2 so that I can get rid of the useless names foo_1 and foo_2 etc? So that I can reuse these names in plan2. What I only want to keep in plan_2 is for_analysis.
  2. Is there a way that I can use functions defined in functions_1.R only for plan1 and functions defined in functions_2.R only for plan2? I would like to work with a smaller set of functions each time.

Thank you a lot!


Solution

  • Interesting question. drake does not support multiple namespaces in plans. All target names must be unique and all function names must be unique, so if you want to reuse names, you would need to put those plans in separate projects altogether.

    You may be running into a situation where you are defining too many targets. Speaking broadly, targets should either (1) produce meaningful output for your project, or (2) eat up enough runtime so that skipping them saves you time. I recommend reading https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets. To condense multiple targets into one, I recommend composing functions together. Example:

    foo_all <- function()
      # Each middle step is super quick, but all put together, they take up noticeable runtime.
      foo <- some_function()
      foo_1 <- fn_1(foo)
      foo_2 <- fn_2(foo_1)
      for_analysis = data_cleaning_fn()
    )
    
    plan1 <- drake_plan(
      for_analysis = foo_all()
    )
    

    Also, drake's branching mechanisms are a convenient way to automatically generate names or avoid having to think about names too hard. Maybe have a look at https://books.ropensci.org/drake/static.html and https://books.ropensci.org/drake/dynamic.html.