rdrake-r-package

What is the best practice for transferring objects across R projects?


I would like to use R objects (e.g., cleaned data) generated in one git-versioned R project in another git-versioned R project.

Specifically, I have multiple git-versioned R projects (that hold drake plans) that do various things for my thesis experiments (e.g., generate materials, import and clean data, generate reports/articles).

The experiment-specific projects should ideally be:

  1. Connectable - so that I can get objects (mainly data and materials) that I generated in these projects into another git-versioned R project that generates my thesis report.
  2. Self-contained - so that I can use them in other non-thesis projects (such as presentations, reports, and journal manuscripts). When sharing such projects, I'd ideally like not to need to share a monolithic thesis project.
  3. Versioned - so that their use in different projects can be independent (e.g., if I make changes to the data cleaning for a manuscript after submitting the thesis, I still want the thesis to be reproducible as it was originally compiled).

At the moment I can see three ways of doing this:

  1. Re-create the data cleaning process
    • But: this involves copy/paste, which I'd like to avoid, especially if things change upstream.
  2. Access the relevant scripts/functions by changing the working directory
    • But: even if I used here it seems that this would introduce poor reproducibility.
  3. Make the source projects into packages and make the objects I want to "export" into exported data (as per the data section of Hadley's R packages guide)

Is there any other way of doing this?

Edit: I tried @landau's suggestion of using a single drake plan, which worked well for a while, until (similar to @vrognas' case) I ended up with too many sub-projects (e.g., conference presentations and manuscripts) that relied on the same objects. Therefore, I added some clarifications above to my intentions with the question.


Solution

  • My first recommendation is to use a single drake plan to unite the stages of the overall project that need to share data. drake is designed to handle a lot of moving parts this way, and it will be more seamless when it comes to drake's decisions about what to rerun downstream. But if you really do need different plans in different sub-projects that share data, you can track each shared dataset as a file_out() file in one plan and track it with file_in() in another plan.

    upstream_plan <- drake_plan(
      export_file = write_csv(dataset, file_out("exported_data/dataset.csv"))
    )
    
    downstream_plan <- drake_plan(
      dataset = read_csv(file_in("../upstream_project/exported_data/dataset.csv"))
    )