I have set up my git repo and initialized dvc. also , added remote storage (aws s3 bucket) , see set up below.
dvc remote add -d awsstorage s3://mlops-artifacts
dvc remote add s3cache s3://mlops-artifacts/cache
dvc config cache.s3 s3cache
git add .dvc/config && git commit -am "comment"
git push
once the set up is complete, i create a new branch, then track my training data via
dvc remote --external s3://mlops-artifacts/experiment/one.csv
dvc remote --external s3://mlops-artifacts/experiment/two.csv
I run my training in sagemaker and model file is dumped to , say s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar.gz. to track this , i add it via dvc
dvc add --external say s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar
git add model.tar.dvc && git commit -am "comment"
git push
I want to know, if there is anything else , i need to do to track the model output. I want to be able to create other branches and do experiments and track their model output. if the generated model is dumped to the same s3 path, adding the model via dvc add --external , should track each version of the model ?
next , i want to add some metadata to the output/model generated , so i downloaded GTO (https://mlem.ai/doc/gto/user-guide/dvc/) and followed the instructions
dvc import-url --no-download s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar
git add model.tar.dvc
gto annotate model --path model.tar
git add artifacts.yaml
git commit -m "annotate it with GTO"
dvc push
git push
how can i add version information to my models . the docs talks about setting model regitry and shows following command to download artifact , what is $Repo here ? how is this all supposed to be set up/work. the docs are not clear here?
dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH
I want to be able to create other branches and do experiments and track their model output
You can create a DVC Pipeline that invokes your training and do experiments/create branches with dvc exp run
https://dvc.org/doc/start/experiments/building-pipelines
if the generated model is dumped to the same s3 path, adding the model via dvc add --external , should track each version of the model ?
IIUC, DVC will copy the file you track with dvc add -external s3://mlops-artifact/something
to DVC cache, so you should be allowed to get access to the file version you did dvc add
for even after re-writing it on s3. https://dvc.org/doc/user-guide/data-management/managing-external-data
next , i want to add some metadata to the output/model generated , so i downloaded GTO (https://mlem.ai/doc/gto/user-guide/dvc/) and followed the instructions
You don't need to run dvc import-url --no-download s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar
because it's already DVC-tracked after running dvc add --external s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar
.
what is $Repo here
It's an example with using shell variable $REPO. Substitute it for github repo URL (or "."
if you cd
to your repo folder).
how can i add version information to my models
You can use command gto register
to create a semantic version for your model. This creates a Git tag, which you can later reference and use to get access to the model version you need. https://mlem.ai/doc/gto/get-started/
Note that's a $REVISION in:
$ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH