I am testing data tracking via dvc (data version control) . I went through the example at dvc.org and added the data. and added the files generated to git as well. now , after I do run
git checkout
do I run dvc pull or dvc checkout. what do these command do behind the scenes?
project
.dvc
train.py
data
abc.csv
data.dvc
I have initialized a git repository , installed and initialized dvc and added some data with command below
dvc add data
dvc pull
command is the same as dvc fetch
+ dvc checkout
.
dvc fetch
is downloading data from the remote storage (can be S3, Google Cloud, etc) into DVC cache, while checkout
then "instantiates" those file in the workspace.
Not a perfect comparison, but roughly you can compare dvc pull
with git pull
, dvc fetch
with git fetch
, and dvc checkout
with git checkout
- they serve similar purpose but for large files or directories that you want to save not in Git directly, but on the cloud, SSH, NAS server, etc.
Btw, besides dvc add
you need to run dvc push
to save your data, so that your team (or you on a different) machine could run dvc pull
later.