I would like to use a validation dataset for early stopping while doing multi-label classification, but it seems that sklearn's MultiOutputClassifier doesn't support that. Do you have any suggestions for a solution?
import numpy, sklearn
from sklearn.multioutput import MultiOutputClassifier
from xgboost import XGBClassifier
# Creating some multi-label data
X_train = numpy.array([[1,2,3],[4,5,6],[7,8,9]])
X_valid = numpy.array([[2,3,7],[3,4,9],[7,8,7]])
Y_train = numpy.array([[1,0],[0,1],[1,1]])
Y_valid = numpy.array([[0,1],[1,1],[0,0]])
# Creating a multi-label xgboost
xgb = XGBClassifier(n_estimators=500, random_state=0, learning_rate=0.05, eval_metric='logloss')
xgb_ml = MultiOutputClassifier(xgb)
# Training the model
xgb_ml.fit(X_train, Y_train)
Everything works as expected till here!
Now I would like to use a validation set to do some early stopping. I use the same parameters as one would use for a normal single label xgboost.
# Training model using an evaluation dataset
xgb_ml.fit(X_train, Y_train, eval_set=[(X_train, Y_train), (X_valid, Y_valid)], early_stopping_rounds=5)
>ValueError: y should be a 1d array, got an array of shape (3, 2) instead.
It seems that the eval_set parameter does not pick up that the model now needs to be evaluated during training on a multi-label dataset. Is this not supported? Or am I doing something wrong?
@afsharov identified the issue in a comment. sklearn
doesn't know anything about the fit_params
, it just passes them along to the individual single-output models.
MultiOutputClassifier
doesn't do very much, so it wouldn't be a big deal to simply loop through the targets, fit xgboost models, and save them into a list. The main hit would seem to be the loss of parallelization, but you could do that yourself as well.
If you really wanted everything wrapped up in a class, I think deriving from MultiOutputClassifier
and overriding the fit
method should be enough. You'd copy most of the original fit method (the classes_
attribute setting and most of the parent class _MultiOutputEstimator
's fit
method), but break the eval_set
second elements into their columns and zip them together for the parallel fitting. Something along the lines of:
# current code
fit_params_validated = _check_fit_params(X, fit_params)
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
delayed(_fit_estimator)(
self.estimator, X, y[:, i], sample_weight,
**fit_params_validated)
for i in range(y.shape[1]))
(source) to
fit_params_validated = _check_fit_params(X, fit_params)
eval_set = fit_params_validated.pop("eval_set", [(X, y)])
eval_set_sliced = [(eval_set_i[0], eval_set_i[1][:, i]) for eval_set_i in eval_set]
self.estimators_ = Parallel(n_jobs=self.n_jobs)(
delayed(_fit_estimator)(
self.estimator, X, y[:, i], sample_weight,
eval_set=eval_set_sliced[i],
**fit_params_validated)
for i in range(y.shape[1]))