rdrake-r-package

drake static branching: How to use .id within map() to increase visibility of dependency graph


I am using a drake workflow to process ~100 files which are stored in a location with very long file names. These long file names make the dependency graph hard to read. Here is a minimal example:

# example setup
library(drake)
very_long_path <- "this_is_a_very_long_file_path_which_makes_the_dependency_graph_hard_to_read"
dir.create(very_long_path)
filenames <- paste0("file_", seq(4), ".csv")
for (file in filenames) {
    file.create(file.path(very_long_path, file))
}
files <- list.files(very_long_path, full.names = TRUE)
ids <- rlang::syms(filenames)

# my drake plan
plan <- drake_plan(
    raw = target(
        read.csv(file_in(!!file)),
        transform = map(file = !!files)
    )
)
plan

## A tibble: 4 x 2
#  target                                           command                                              
#  <chr>                                            <expr>                                               
#1 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~
#2 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~
#3 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~
#4 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~

vis_drake_graph(drake_config(plan)) ## very hard to read

unreadable dependecy graph

I've read the following about .id in ?transformations:

Symbol or vector of symbols naming grouping variables to incorporate into target names. Useful for creating short target names. Set .id = FALSE to use integer indices as target name suffixes.

That's why I created ids in the code above in order to provide short names for the targets. But changing the plan as follows did not help:

plan <- drake_plan(
    raw = target(
        readLines(file_in(!!file)),
        transform = map(file = !!files,
                        .id = !!ids)
    )
)
plan

## A tibble: 4 x 2
#  target                                           command                                              
#  <chr>                                            <expr>                                               
#1 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~
#2 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~
#3 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~
#4 raw_this_is_a_very_long_file_path_which_makes_t~ readLines(file_in("this_is_a_very_long_file_path_whic~

From my understanding, ids is a vector of symbols, so I do not understand why this is not working. What am I missing? Is that even possible?


I also tried to insert ids as a character vector, without success. I know that I can set .id = FALSE to simply enumerate the elements of raw, but I really want to keep the file names.


Solution

  • You are very close. All you need to do is register ids as a grouping variable and then pass the grouping variable symbol to .id.

    library(drake)
    very_long_path <- "this_is_a_very_long_file_path_which_makes_the_dependency_graph_hard_to_read"
    dir.create(very_long_path)
    
    filenames <- paste0("file_", seq(4), ".csv")
    
    for (file in filenames) {
      file.create(file.path(very_long_path, file))
    }
    
    files <- list.files(very_long_path, full.names = TRUE)
    ids <- rlang::syms(filenames)
    
    plan <- drake_plan(
      raw = target(
        read.csv(file_in(!!file)),
        transform = map(
          file = !!files,
          id_var = !!ids, # Register the grouping variable.
          .id = id_var    # Use the existing grouping variable.
        )
      )
    )
    
    plan
    #> # A tibble: 4 x 2
    #>   target        command                                                         
    #>   <chr>         <expr>                                                          
    #> 1 raw_file_1.c… read.csv(file_in("this_is_a_very_long_file_path_which_makes_the…
    #> 2 raw_file_2.c… read.csv(file_in("this_is_a_very_long_file_path_which_makes_the…
    #> 3 raw_file_3.c… read.csv(file_in("this_is_a_very_long_file_path_which_makes_the…
    #> 4 raw_file_4.c… read.csv(file_in("this_is_a_very_long_file_path_which_makes_the…
    
    plan$target
    #> [1] "raw_file_1.csv" "raw_file_2.csv" "raw_file_3.csv" "raw_file_4.csv"
    

    Created on 2020-01-21 by the reprex package (v0.3.0)