R {drake} plan: Read many datasets into single target

I started to use {drake} for a data production pipeline. The raw data I work with is quite large and is split up into ~130 separate (Stata) files. Thus, each file should be processed separately. In order to keep it readable, I use target(), transform() and map() to specify my plan. This looks similar to the code below:

plan <- drake_plan(
    dta_paths = list.files(my_folder, full.names = TRUE),
    dfs = target(
        read.dta13(dta_path),
        transform = map(dta_path = dta_paths)
    )
)

So when I make() the plan, I get the following error:

target dfs_dta_paths

Warning: target dfs_dta_paths warnings:

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

the condition has length > 1 and only the first element will be used

fail dfs_dta_paths

Error: Target dfs_dta_paths failed. Call diagnose(dfs_dta_paths) for details. Error message:

Expecting a single string value: [type=character; extent=129].

From what I understand from this warning and error messages, the mapping over the different file paths is not working and the full vector is passed to the first function call. I read https://books.ropensci.org/drake/static.html#map but it did not help in figuring out the problem. Also converting the vector of paths to a list did not help.

From How to combine multiple drake targets into a single cross target without combining the datasets? I got the idea of predefining a grid, which actually works as suggested. But since I do only need a vector, not a complex grid, this looks like over-engineering to me.

I feel like I'm missing something obvious, but I can't spot it. Any ideas what's wrong with my code?

I am aware of https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets, but since I want to iterate in the process of data cleaning, I thought it would be helpful to create the dfs target as shown above.

Solution

When you use target(transform = ...), it is always a best to visualize the plan before you feed it to make(). It could take a couple iterations to get it right. Here is what your current plan looks like.

library(drake)
plan <- drake_plan(
  dta_paths = list.files(my_folder, full.names = TRUE),
  dfs = target(
    read.dta13(dta_path),
    transform = map(dta_path = dta_paths)
  )
)

plan
#> # A tibble: 2 x 2
#>   target        command                                 
#>   <chr>         <expr>                                  
#> 1 dta_paths     list.files(my_folder, full.names = TRUE)
#> 2 dfs_dta_paths read.dta13(dta_paths)

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2020-01-16 by the reprex package (v0.3.0)}

To read one file per target, I recommend the plan below. See https://books.ropensci.org/drake/static.html#tidy-evaluation for more on why it uses !!.

library(drake)

# create some faux stata files for the example.
my_folder <- fs::dir_create("folder")
file.create("folder/file1.dta")
#> [1] TRUE
file.create("folder/file2.dta")
#> [1] TRUE

# Since you are using static branching (https://books.ropensci.org/drake/static.html)
# this needs to be defined up front.
# It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
dta_paths <- list.files(my_folder, full.names = TRUE)

plan <- drake_plan(
  dfs = target(
    # Use !! here to literally insert the path so file_out() can mark it for tracking.
    read.dta13(file_in(!!dta_path)),
    # Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
    transform = map(dta_path = !!dta_paths)
  )
)

plan
#> # A tibble: 2 x 2
#>   target               command                                
#>   <chr>                <expr>                                 
#> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
#> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))

config <- drake_config(plan)
vis_drake_graph(config)

^{Created on 2020-01-16 by the reprex package (v0.3.0)}