amazon-s3dvcmlem

how to track models output via dvc and gto?


I have set up my git repo and initialized dvc. also , added remote storage (aws s3 bucket) , see set up below.

dvc remote add -d awsstorage s3://mlops-artifacts
dvc remote add s3cache s3://mlops-artifacts/cache
dvc config cache.s3 s3cache

git add .dvc/config && git commit -am "comment"
git push

once the set up is complete, i create a new branch, then track my training data via

dvc remote --external s3://mlops-artifacts/experiment/one.csv
dvc remote --external s3://mlops-artifacts/experiment/two.csv

I run my training in sagemaker and model file is dumped to , say s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar.gz. to track this , i add it via dvc

dvc add --external say s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar
git add model.tar.dvc && git commit -am "comment"
git push

I want to know, if there is anything else , i need to do to track the model output. I want to be able to create other branches and do experiments and track their model output. if the generated model is dumped to the same s3 path, adding the model via dvc add --external , should track each version of the model ?

next , i want to add some metadata to the output/model generated , so i downloaded GTO (https://mlem.ai/doc/gto/user-guide/dvc/) and followed the instructions

dvc import-url --no-download s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar
git add model.tar.dvc

gto annotate model --path model.tar

git add artifacts.yaml
git commit -m "annotate it with GTO"

dvc push
git push

how can i add version information to my models . the docs talks about setting model regitry and shows following command to download artifact , what is $Repo here ? how is this all supposed to be set up/work. the docs are not clear here?

dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH

Solution

  • I want to be able to create other branches and do experiments and track their model output

    You can create a DVC Pipeline that invokes your training and do experiments/create branches with dvc exp run https://dvc.org/doc/start/experiments/building-pipelines

    if the generated model is dumped to the same s3 path, adding the model via dvc add --external , should track each version of the model ?

    IIUC, DVC will copy the file you track with dvc add -external s3://mlops-artifact/something to DVC cache, so you should be allowed to get access to the file version you did dvc add for even after re-writing it on s3. https://dvc.org/doc/user-guide/data-management/managing-external-data

    next , i want to add some metadata to the output/model generated , so i downloaded GTO (https://mlem.ai/doc/gto/user-guide/dvc/) and followed the instructions

    You don't need to run dvc import-url --no-download s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar because it's already DVC-tracked after running dvc add --external s3://mlops-artifacts/output/sagemaker-experiment/output/model.tar.

    what is $Repo here

    It's an example with using shell variable $REPO. Substitute it for github repo URL (or "." if you cd to your repo folder).

    how can i add version information to my models

    You can use command gto register to create a semantic version for your model. This creates a Git tag, which you can later reference and use to get access to the model version you need. https://mlem.ai/doc/gto/get-started/

    Note that's a $REVISION in:

    $ dvc get $REPO $ARTIFACT_PATH --rev $REVISION -o $OUTPUT_PATH