While I know that the main use case of DVC comes after the "data engineering" parts, I have written something that works quite nice for me but is missing one feature.
So nightly i run a pipeline that (for sake of this example) collects github commits data of the past 3 days (3 days so we can fill in some recent updates). The data is written in date-partitioned directory format like this:
# Tracking with 'dvc add data/raw/commits/´
data/raw/commits/2024/09/01/data.json
data/raw/commits/2024/09/02/data.json
data/raw/commits/2024/09/03/data.json
...
My initial run would collect the whole years of data and my commits.dvc file would say nfiles: 184 but when my nightly run starts and collects the past 3 days - runs dvc add and dvc push, I am left with a commits.dvc file that tracks only those recent 3 files.
Is there a way that i can incrementally add tracked files into a tracked directory without pulling the whole history from my remote every time i collect new data?
DVC does allow for "granular" dataset updates. Specifically for such scenarios where it would be time consuming and painful to pull
the whole dataset just to update a few files.
The documentation page for this is here - Modifying Large Datasets.
Basically, in this particular case, I think we can do:
dvc add data/raw/commits/2024/09/01/data.json
or
dvc add data/raw/commits/2024/09/01
(mind that we specify a path inside the tracked directory, which is data/raw/commits/
in this case)
to add a subdirectory or a few files.