I am trying to figure out how CatBoost performs multiclass classification with MultiClass loss function. As I understand it, for each prediction MultiClass requires M values for each of M classes. My questions are:
How are those M values are obtained?
How are those M values are transferred to predicted probabilities?
My current hypothesis is that CatBoost builds separate binary classifier for each of M classes and then uses softmax function to get the predicted probabilities.
For some other common GBMs, I've seen that they work as your hypothesis, building the one-vs-rest classifiers (completely different in general) and then at the end applying softmax to recover final predictions.
But apparently CatBoost builds one set of multi-output trees:
https://github.com/catboost/catboost/issues/1806