gitversioningdvc

Incremental add with DVC tracked directories


While I know that the main use case of DVC comes after the "data engineering" parts, I have written something that works quite nice for me but is missing one feature.

So nightly i run a pipeline that (for sake of this example) collects github commits data of the past 3 days (3 days so we can fill in some recent updates). The data is written in date-partitioned directory format like this:

# Tracking with 'dvc add data/raw/commits/´
data/raw/commits/2024/09/01/data.json
data/raw/commits/2024/09/02/data.json
data/raw/commits/2024/09/03/data.json
...

My initial run would collect the whole years of data and my commits.dvc file would say nfiles: 184 but when my nightly run starts and collects the past 3 days - runs dvc add and dvc push, I am left with a commits.dvc file that tracks only those recent 3 files.

Is there a way that i can incrementally add tracked files into a tracked directory without pulling the whole history from my remote every time i collect new data?


Solution

  • DVC does allow for "granular" dataset updates. Specifically for such scenarios where it would be time consuming and painful to pull the whole dataset just to update a few files.

    The documentation page for this is here - Modifying Large Datasets.

    Basically, in this particular case, I think we can do:

    dvc add data/raw/commits/2024/09/01/data.json

    or

    dvc add data/raw/commits/2024/09/01

    (mind that we specify a path inside the tracked directory, which is data/raw/commits/ in this case)

    to add a subdirectory or a few files.