[SOLVED] Incremental add with DVC tracked directories

Incremental add with DVC tracked directories

While I know that the main use case of DVC comes after the "data engineering" parts, I have written something that works quite nice for me but is missing one feature.

So nightly i run a pipeline that (for sake of this example) collects github commits data of the past 3 days (3 days so we can fill in some recent updates). The data is written in date-partitioned directory format like this:

# Tracking with 'dvc add data/raw/commits/´
data/raw/commits/2024/09/01/data.json
data/raw/commits/2024/09/02/data.json
data/raw/commits/2024/09/03/data.json
...

My initial run would collect the whole years of data and my commits.dvc file would say nfiles: 184 but when my nightly run starts and collects the past 3 days - runs dvc add and dvc push, I am left with a commits.dvc file that tracks only those recent 3 files.

Is there a way that i can incrementally add tracked files into a tracked directory without pulling the whole history from my remote every time i collect new data?

Solution

DVC does allow for "granular" dataset updates. Specifically for such scenarios where it would be time consuming and painful to pull the whole dataset just to update a few files.

The documentation page for this is here - Modifying Large Datasets.

Basically, in this particular case, I think we can do:

dvc add data/raw/commits/2024/09/01/data.json

dvc add data/raw/commits/2024/09/01

(mind that we specify a path inside the tracked directory, which is data/raw/commits/ in this case)

to add a subdirectory or a few files.