rtargettargets-r-package

Dealing with zip files in a targets workflow


I'm trying to set up a workflow that involves downloading a zip file, extracting its contents, and applying a function to each one of its files.

There are a few issues I'm running into:

  1. How do I set up an empty file system reproducibly? Namely, I'm hoping to be able to create a system of empty directories to which files will later be downloaded to. Ideally, I'd like to do something like tar_target(my_dir, fs::dir_create("data"), format = "file"), but I know from the documentation that empty directories are not able to be used with format = "file". I know I could just do a dir_create at every instance which I need it, but this seems clumsy.

  2. In the reprex below I'd like to operate individually on each file using pattern = map(x). As the error suggests, I'd need to specify a pattern for the parent target, since format = "file". You can see that if I did specify a pattern for the parent target, I would again need to do it for its parent target. As far as I know, a pattern cannot be set for a target that has no parents (but I have been wrong many times before).

I have a feeling I'm going about this all wrong - thank you for your time.

library(targets)
tar_script({
    tar_option_set(packages = c("tidyverse", "fs"))
    download_file <- function(url, dest) {
        download.file(url, dest)
        dest
    }
    do_stuff <- function(file_path) {
        fs::file_copy(file_path, file_path, overwrite = TRUE)
    }
    list(
      tar_target(downloaded_zip, 
                 download_file("https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip", 
                               path(dir_create("data"), "file", ext = "zip")), 
                 format = "file"), 
 
      tar_target(extracted_files, 
                 unzip(downloaded_zip, exdir = dir_create("data")), 
                 format = "file"), 

      tar_target(stuff_done, 
                 do_stuff(extracted_files), 
                 pattern = map(extracted_files), format = "file", 
                 iteration = "list"))
})
tar_make()
#> * start target downloaded_zip
#> trying URL 'https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip'
#> Content type 'application/zip' length 2036861 bytes (1.9 MB)
#> ==================================================
#> downloaded 1.9 MB
#> 
#> * built target downloaded_zip
#> * start target extracted_files
#> * built target extracted_files
#> * end pipeline
#> Error : Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Error: callr subprocess failed: Target stuff_done tried to branch over extracted_files, which is illegal. Patterns must only branch over explicitly declared targets in the pipeline. Stems and patterns are fine, but you cannot branch over branches or global objects. Also, if you branch over a target with format = "file", then that target must also be a pattern.
#> Visit https://books.ropensci.org/targets/debugging.html for debugging advice.

Created on 2021-12-08 by the reprex package (v2.0.1)


Solution

  • Original answer

    Here's an idea: you could track that URL with format = "url" and then make the URL a dependency of all the file branches. Below, all of files should rerun then the upstream online data changes. That's fine because all that does is just re-hash stuff. But then not all branches of stuff_done should run if only some of those files actually changed.

    Edit

    On second thought, we probably need to hash the local files all in bulk. Not the most efficient, but it gets the job done. targets wants you to use its own built-in storage system instead of external files, so if you can read the data in and return it in a non-file format, dynamic branching will be easier.

    # _targets.R file
    library(targets)
    tar_option_set(packages = c("tidyverse", "fs"))
    download_file <- function(url, dest) {
      download.file(url, dest)
      dest
    }
    do_stuff <- function(file_path) {
      file.info(file_path)
    }
    download_and_unzip <- function(url) {
      downloaded_zip <- tempfile()
      download_file(url, downloaded_zip)
      unzip(downloaded_zip, exdir = dir_create("data"))
    }
    list(
      tar_target(
        url,
        "https://file-examples-com.github.io/uploads/2017/02/zip_2MB.zip",
        format = "url"
      ),
      tar_target(
        files_bulk,
        download_and_unzip(url),
        format = "file"
      ),
      tar_target(file_names, files_bulk), # not a format = "file" target
      tar_target(
        files, {
          files-bulk # Re-hash all the files separately if any file changes.
          file_names
        },
        pattern = map(file_names),
        format = "file"
      ),
      tar_target(stuff_done, do_stuff(files), pattern = map(files))
    )