rdrake-r-package

How do I read dynamic files in drake?


I want use drake's dynamic targets to read multiple files. I wrote the following plan based on my understanding of how dynamic files work. However, when the input file changes, drake does not correctly update all targets.

What is the correct way to use drake's dynamic files to read files?

In other words, what is the dynamic files version of file_in() to solve this problem: How can I import from multiple files in r-drake?

library(drake)
library(tidyverse)

content <- tibble(x1 = 1, x2 = 1)
walk(list("a", "b"), ~ write_csv(x = content, path = paste0(., ".csv")))
read_csv("b.csv", col_types = "dd")
#> # A tibble: 1 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1

plan <- drake::drake_plan(
  import_paths = target(c(
    a = "a.csv",
    b = "b.csv"
  ),
  format = "file"
  ),

  data = target(
    read_csv(import_paths, col_types = "dd"),
    dynamic = map(import_paths)
  )
)

drake::make(plan)
#> ▶ target import_paths
#> ▶ dynamic data
#> > subtarget data_44119303
#> > subtarget data_ecc6ebe6
#> ■ finalize data
readd(data)
#> # A tibble: 2 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     1

walk(list("b"), ~ write_csv(x = content + 1, path = paste0(., ".csv")))
read_csv("b.csv", col_types = "dd")
#> # A tibble: 1 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     2     2

drake::make(plan)
#> ▶ target import_paths
#> ■ finalize data
readd(data)
#> # A tibble: 2 x 2
#>      x1    x2
#>   <dbl> <dbl>
#> 1     1     1
#> 2     1     1

Created on 2020-08-06 by the reprex package (v0.3.0)


Solution

  • Perhaps this is not obvious, but dynamic file targets are irreducible. If c("a.csv", "b.csv") is your dynamic file, you cannot break it up into "a.csv" and " b.csv". drake stores a global hash of all those files together, and it does not keep track of the hashes or timestamps on a file by file basis. This helps drake stay efficient even if you return a large number of dynamic files from a single target.

    The solution is to make "a.csv" and "b.csv" two different dynamic file targets using a dynamic map(). You need an extra target at the beginning just to contain the path names, but it gets the job done.

    library(drake)
    library(tidyverse)
    
    content <- tibble(x1 = 1, x2 = 1)
    walk(list("a", "b"), ~ write_csv(x = content, path = paste0(., ".csv")))
    read_csv("b.csv", col_types = "dd")
    #> # A tibble: 1 x 2
    #>      x1    x2
    #>   <dbl> <dbl>
    #> 1     1     1
    
    plan <- drake_plan(
      import_paths = c("a.csv", "b.csv"),
      import_files = target(
        import_paths,
        format = "file",
        dynamic = map(import_paths)
      ),
      data = target(
        read_csv(import_files, col_types = "dd"),
        dynamic = map(import_files)
      )
    )
    
    make(plan)
    #> ▶ target import_paths
    #> ▶ dynamic import_files
    #> > subtarget import_files_4209ea92
    #> > subtarget import_files_b8419eb2
    #> ■ finalize import_files
    #> ▶ dynamic data
    #> > subtarget data_b59aea49
    #> > subtarget data_e6b8ef3e
    #> ■ finalize data
    
    readd(data)
    #> # A tibble: 2 x 2
    #>      x1    x2
    #>   <dbl> <dbl>
    #> 1     1     1
    #> 2     1     1
    
    walk(list("b"), ~ write_csv(x = content + 1, path = paste0(., ".csv")))
    read_csv("b.csv", col_types = "dd")
    #> # A tibble: 1 x 2
    #>      x1    x2
    #>   <dbl> <dbl>
    #> 1     2     2
    
    make(plan)
    #> ▶ dynamic import_files
    #> > subtarget import_files_b8419eb2
    #> ■ finalize import_files
    #> ▶ dynamic data
    #> > subtarget data_a0f1c4f0
    #> ■ finalize data
    
    readd(data)
    #> # A tibble: 2 x 2
    #>      x1    x2
    #>   <dbl> <dbl>
    #> 1     1     1
    #> 2     2     2
    

    Created on 2020-08-06 by the reprex package (v0.3.0)