Can you recommend from your experience about choosing a convenient tracking experiment tool and versioning only "Multi independent models, but one input->multi-models->one output" in order to get single main evaluation and conveniently compare sub-evaluations? see a project example in the diagram.
I understand and tried to use W&B, MLFlow, DVC, Neptune.ai, DagsHub, TensorBoard for only one model, but I'm not sure one is convenient to use for multi-independent models. I also did not find it in Google for the approximate phrase "ML tracking experiment and management for multi models"
Disclaimer: I'm co-founder at Iterative, we are authors of DVC. My response doesn't come from my experience with all the tools mentioned above. I took this as an opportunity to try build a template for this use case in the DVC ecosystem and share this in case it's useful for anyone.
Here is the GitHub repo, I've built (Note: it's a template, not a real ML project, scripts are artificially simplified to show the essence of the multi model evaluation):
I've put together an extensive README with a few videos of CLI, VS Code, Studio tools.
The core part of the repo is this DVC pipeline, that "trains" multiple models, collects their metrics, and then runs evaluation
stage to "reduce" those metrics into the final one.
stages:
train:
foreach:
- model-1
- model-2
do:
cmd: python train.py
wdir: ${item}
params:
- params.yaml:
deps:
- train.py
- data
outs:
- model.pkl:
cache: false
metrics:
- ../dvclive/${item}/metrics.json:
cache: false
plots:
- ../dvclive/${item}/plots/metrics/acc.tsv:
cache: false
x: step
y: acc
evaluate:
cmd: python evaluate.py
deps:
- dvclive
metrics:
- evaluation/metrics.json:
cache: false
It describes how to build and connect different things in the project, also makes the project "runnable" and reproducible. It can scale to any number of models (the first foreach
clause).
Please, let me know if that fits your scenario and/or you have more requirements, happy to learn mode and iterate on it :)