dvc

Does dvc checkout pulls the data or just checkouts .dvc files?


I am testing data tracking via dvc (data version control) . I went through the example at dvc.org and added the data. and added the files generated to git as well. now , after I do run

git checkout 

do I run dvc pull or dvc checkout. what do these command do behind the scenes?

project
  .dvc
  train.py
  data
    abc.csv
  data.dvc

I have initialized a git repository , installed and initialized dvc and added some data with command below

dvc add data

Solution

  • dvc pull command is the same as dvc fetch + dvc checkout.

    dvc fetch is downloading data from the remote storage (can be S3, Google Cloud, etc) into DVC cache, while checkout then "instantiates" those file in the workspace.

    Not a perfect comparison, but roughly you can compare dvc pull with git pull, dvc fetch with git fetch, and dvc checkout with git checkout- they serve similar purpose but for large files or directories that you want to save not in Git directly, but on the cloud, SSH, NAS server, etc.

    Btw, besides dvc add you need to run dvc push to save your data, so that your team (or you on a different) machine could run dvc pull later.