rdrake-r-package

Workflow for drake plan with increasing input


I have a drake plan which uses a input folder with file_in. Then reads each file inside the folder and makes a number of transformations. Finally, it joins the results.

If I add a new file, I would like that the new calculations in plan are only applied to this file, and then joined to the previous results. However, what the plan does is: it detects a change in target, then recalculates all targets based on that target.

Note: The number of files is quite large (several thousands), and calculations heavy.

Solution (look at landau's solution for a better solution)

This solution completes the answer I marked as solution:

Any file or directory you declare with file_in() or target(format = "file") is treated as an irreducible unit of data, and this behavior in drake will not change in future development. But you can split up the files among multiple targets so some targets stay up to date if a file changes.

library(drake)

drake_plan(
  input  = target(list.files(file_in("/path/to/folder")),format="file"),
  target1 = target(do_stuff1(input), dynamic=map(input)) 
)

This will make dynamic targets, and therefore the new files will create new dynamic targets, but the old target will not be re-calculated.


Solution

  • Any file or directory you declare with file_in() or target(format = "file") is treated as an irreducible unit of data, and this behavior in drake will not change in future development. But you can split up the files among multiple targets so some targets stay up to date if a file changes.

    library(drake)
    group1 <- c("file1", "file2")
    group2 <- c("file3", "file4")
    drake_plan(
      target1 = do_stuff(file_in(!!group1)),
      target2 = do_stuff(file_in(!!group2))
    )
    #> # A tibble: 2 x 2
    #>   target  command                               
    #>   <chr>   <expr_lst>                            
    #> 1 target1 do_stuff(file_in(c("file1", "file2")))
    #> 2 target2 do_stuff(file_in(c("file3", "file4")))
    

    Created on 2020-09-04 by the reprex package (v0.3.0)

    With dynamic branching

    Dynamic branching over files is trickier, and file_in() is for static targets only. Even then, it may be suboptimal to create a dynamic sub-target for every single file because you have thousands of them. It is probably better to batch files into groups and give each group to a sub-target. But if you still want to dynamically branch over every single file, here is the way to do it that ensures each file is properly reproducibly tracked for changes.

    library(drake)
    drake_plan(
      # Always run to get the latest set of file paths.
      untracked_files = target(
        list.files("directory_with_files", full.names = TRUE),
        trigger = trigger(condition = TRUE)
      ),
      # Map over the vector of file paths and reproducibly track each one.
      tracked_files = target(
        untracked_files,
        dynamic = map(untracked_files)
      ),
      # Map over the tracked files and analyze each one.
      analyses = target(
        do_stuff(tracked_files),
        dynamic = map(tracked_files)
      )
    )
    #> # A tibble: 3 x 4
    #>   target       command                          trigger           dynamic       
    #>   <chr>        <expr_lst>                       <expr_lst>        <expr_lst>    
    #> 1 untracked_f… list.files("directory_with_file… trigger(conditio… NA           …
    #> 2 tracked_fil… untracked_files                … NA              … map(untracked…
    #> 3 analyses     do_stuff(tracked_files)        … NA              … map(tracked_f…
    

    Created on 2020-09-17 by the reprex package (v0.3.0)

    This is slightly easier in targets due to tarchetypes::tar_files().