My team has a set up wherein we track datasets and models in DVC, and have a GitLab repository for tracking our code and DVC metadata files. We have a job in our dev GitLab pipeline (run on each push to a merge request) that has the goal of checking to be sure that the developer remembered to run dvc push
to keep DVC remote storage up-to-date. Right now, the way we do this is by running dvc pull
on the GitLab runner, which will fail with errors telling you which files (new files or latest versions of existing files) were not found.
The downside to this approach is that we are loading the entirety of our data stored in DVC onto a GitLab runner, and we've run into out-of-memory issues, not to mention lengthy run time to download all that data. Since the path and md5 hash of the objects are stored in the DVC metadata files, I would think that's all the information that DVC would need to be able to answer the question "is the remote storage system up-to-date".
It seems like dvc status
is similar to what I'm asking for, but compares the cache or workspace and remote storage. In other words, it requires the files to actually be present on whatever filesystem is making the call.
Is there some way to achieve the goal I laid out above ("inform the developer that they need to run dvc push
") without pulling everything from DVC?
It seems like dvc status is similar to what I'm asking for
dvc status --cloud
will give you a list of "new" files if they that haven't been pushed to the (default) remote. It won't error out though, so your CI script should fail depending on the stdout message.
More info: https://dvc.org/doc/command-reference/status#options
I'd also ask everyone to run dvc install
, which will setup some Git hooks, including automatic dvc push
with git push
.