[SOLVED] How to configure/track data with dvc?

How to configure/track data with dvc?

I'm tying out DVC (https://dvc.org/) , based on the docs provided , i followed the sample, with following commands (see below) . I created a folder called storage and ran => dvc add storage.

now if i started adding data or csv , for example somefile.csv to this folder, do i need to run dvc add storage/somefile.csv ?

and i eventually want to run this on aws , so once is setup a s3 bucket , push my data to the bucket and run my training job on aws. I'm looking at CML (cml.dev) as well, which looks like , let me do that . will CML configuration know to pull from my remote storage?

also , i'm not too familiar with CML yet , is this for just running jobs?

i've tried set up with following commands

- git init
- dvc init

- mkdir storage 

#added a storage folder for data then 
- dvc add storage

- git commit

#set up the remote storage
- dvc remote add remote_storage s3://somebucket

- dvc push

Solution

Some high-level answers for your high-level questions.

DVC is designed to behave in a very similar way to git, if you are ever in dought you should run dvc status

In your example, if you add a file to ./storage then dvc status will show that it has been modified, you can run dvc commit which will update your tracking file storage.dvc which should be then committed with git. I would run dvc install to set up some pre-commit hooks which will help do this automatically for you.

Your .dvc/config should contain the bucket name and config options. So CML or anyone machine can run dvc pull to get the data so long as they have credentials.

On CML I would think of it as a collection of tools to help you interact with GitLab/GitHub you can follow one of the examples at https://cml.dev. It sounds like you are looking at the cml runner launch command (using GitHub as an example) can provision an ec2 instance for you and installs the GitHub Actions agent to run subsequent CI/CD jobs.