I started to use {drake} for a data production pipeline. The raw data I work with is quite large and is split up into ~130 separate (Stata) files. Thus, each file should be processed separately. In order to keep it readable, I use target()
, transform()
and map()
to specify my plan. This looks similar to the code below:
plan <- drake_plan(
dta_paths = list.files(my_folder, full.names = TRUE),
dfs = target(
read.dta13(dta_path),
transform = map(dta_path = dta_paths)
)
)
So when I make()
the plan, I get the following error:
target dfs_dta_paths
Warning: target dfs_dta_paths warnings:
the condition has length > 1 and only the first element will be used
the condition has length > 1 and only the first element will be used
the condition has length > 1 and only the first element will be used
fail dfs_dta_paths
Error: Target
dfs_dta_paths
failed. Calldiagnose(dfs_dta_paths)
for details. Error message:Expecting a single string value: [type=character; extent=129].
From what I understand from this warning and error messages, the mapping over the different file paths is not working and the full vector is passed to the first function call. I read https://books.ropensci.org/drake/static.html#map but it did not help in figuring out the problem. Also converting the vector of paths to a list did not help.
From How to combine multiple drake targets into a single cross target without combining the datasets? I got the idea of predefining a grid, which actually works as suggested. But since I do only need a vector, not a complex grid, this looks like over-engineering to me.
I feel like I'm missing something obvious, but I can't spot it. Any ideas what's wrong with my code?
I am aware of https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets, but since I want to iterate in the process of data cleaning, I thought it would be helpful to create the dfs
target as shown above.
When you use target(transform = ...)
, it is always a best to visualize the plan before you feed it to make()
. It could take a couple iterations to get it right. Here is what your current plan looks like.
library(drake)
plan <- drake_plan(
dta_paths = list.files(my_folder, full.names = TRUE),
dfs = target(
read.dta13(dta_path),
transform = map(dta_path = dta_paths)
)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 dta_paths list.files(my_folder, full.names = TRUE)
#> 2 dfs_dta_paths read.dta13(dta_paths)
config <- drake_config(plan)
vis_drake_graph(config)
Created on 2020-01-16 by the reprex package (v0.3.0)
To read one file per target, I recommend the plan below. See https://books.ropensci.org/drake/static.html#tidy-evaluation for more on why it uses !!
.
library(drake)
# create some faux stata files for the example.
my_folder <- fs::dir_create("folder")
file.create("folder/file1.dta")
#> [1] TRUE
file.create("folder/file2.dta")
#> [1] TRUE
# Since you are using static branching (https://books.ropensci.org/drake/static.html)
# this needs to be defined up front.
# It does not need to be a target, re https://books.ropensci.org/drake/plans.html#how-to-choose-good-targets
dta_paths <- list.files(my_folder, full.names = TRUE)
plan <- drake_plan(
dfs = target(
# Use !! here to literally insert the path so file_out() can mark it for tracking.
read.dta13(file_in(!!dta_path)),
# Use !! here to insert the actual vector of paths instead of the symbol `dta_paths`
transform = map(dta_path = !!dta_paths)
)
)
plan
#> # A tibble: 2 x 2
#> target command
#> <chr> <expr>
#> 1 dfs_folder.file1.dta read.dta13(file_in("folder/file1.dta"))
#> 2 dfs_folder.file2.dta read.dta13(file_in("folder/file2.dta"))
config <- drake_config(plan)
vis_drake_graph(config)
Created on 2020-01-16 by the reprex package (v0.3.0)