pythonscikit-learnxgboostmulticlass-classificationoverfitting-underfitting

How to avoid overfitting on multiclass classification OvR Xgboost model / class_weight in Python?


I try to build multiclass classification model in Python using XGBoost OvR (OneVsRest) like below:

from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import roc_auc_score

X_train, X_test, y_train, y_test = train_test_split(abt.drop("TARGET", axis=1)
                                                                        , abt["TARGET"]
                                                                        , train_size = 0.70
                                                                        , test_size=0.30
                                                                        , random_state=123
                                                                        , stratify = abt["TARGET"])

model_1 = OneVsRestClassifier(XGBClassifier())

When I used above code I have HUGE overfitting: AUC_TRAIN: 0.9988, AUC_TEST: 0.7650

Si, I decided to use: class_weight.compute_class_weight

from sklearn.utils import class_weight
class_weights = class_weight.compute_class_weight('balanced',
                                                 np.unique(y_train),
                                                 y_train)

model_1.fit(X_train, y_train, class_weight=class_weights)

metrics.roc_auc_score(y_train, model_loop_a.predict_proba(X_train), multi_class='ovr')

metrics.roc_auc_score(y_test, model_loop_a.predict_proba(X_test), multi_class='ovr')

Nevertheless, when I try to use class_weight.compute_class_weight like above, I have the following error: TypeError: fit() got an unexpected keyword argument 'class_weight'

How can i fix that, or maybe you have some other idea how to avoid such HUGE overfitting on my multiclass classification model in Python ?


Solution

  • The issue in your case seems to be that the OneVsRestClassifier object does not support the class_weight parameter as base estimator see doc

    A way around this would be to use the "balanced" parameter (as a float = 1) in the XGBClassifier definition (this will automatically adjust the weights of each class based on their frequency in the training set).

    model_1 = OneVsRestClassifier(XGBClassifier(scale_pos_weight=1))
    

    This will force the balancing of positive and negative weights.

    scale_pos_weight (Optional[float]) – Balancing of positive and negative weights.


    See also the doc: https://xgboost.readthedocs.io/en/stable/python/python_api.html