I'm tying out DVC (https://dvc.org/) , based on the docs provided , i followed the sample, with following commands (see below) . I created a folder called storage and ran => dvc add storage.
now if i started adding data or csv , for example somefile.csv to this folder, do i need to run dvc add storage/somefile.csv ?
and i eventually want to run this on aws , so once is setup a s3 bucket , push my data to the bucket and run my training job on aws. I'm looking at CML (cml.dev) as well, which looks like , let me do that . will CML configuration know to pull from my remote storage?
also , i'm not too familiar with CML yet , is this for just running jobs?
i've tried set up with following commands
- git init
- dvc init
- mkdir storage
#added a storage folder for data then
- dvc add storage
- git commit
#set up the remote storage
- dvc remote add remote_storage s3://somebucket
- dvc push
Some high-level answers for your high-level questions.
DVC is designed to behave in a very similar way to git, if you are ever in dought you should run dvc status
In your example, if you add a file to ./storage
then dvc status
will show that it has been modified, you can run dvc commit
which will update your tracking file storage.dvc
which should be then committed with git. I would run dvc install
to set up some pre-commit hooks which will help do this automatically for you.
Your .dvc/config
should contain the bucket name and config options. So CML or anyone machine can run dvc pull
to get the data so long as they have credentials.
On CML I would think of it as a collection of tools to help you interact with GitLab/GitHub you can follow one of the examples at https://cml.dev. It sounds like you are looking at the cml runner launch
command (using GitHub as an example) can provision an ec2 instance for you and installs the GitHub Actions agent to run subsequent CI/CD jobs.