pythonscikit-learnrandomized-algorithmmake-scorer

Matthew's Correlation Coefficient and Precision throws errors in RandomizedSearchCV


I keep getting this error:

invalid value encountered in double_scalars: mcc = cov_ytyp / np.sqrt(cov_ytyt * cov_ypyp)

Is there something wrong with how I implemented it in a custom-scorer?

parameters_XG = {'n_estimators': np.arange(50, 500, 50),
                 'learning_rate':np.arange(0.1, 1.05, .05),
                 'colsample_bytree': np.arange(0.1, 1.05, .05),
                 'sub_sample': np.arange(0.5, 1.05, .05),
                 'min_child_weight': np.arange(1, 10),
                 'gamma': np.arange(0.1, 5, 0.2),
                 'max_depth': np.arange(1, 15), 
                 'scale_pos_weight': np.arange(0.1, 1.0, .05)}


XG_model = XGBClassifier(booster = 'gbtree', random_state=2504, n_jobs = -1)


multi_score = {'neg_log_loss': 'neg_log_loss',
               'precision': 'precision',
               'recall': 'recall',
               'F1_weighted': 'f1_weighted',
               'ROC_AUC': 'roc_auc',
               'Brier_score': 'brier_score_loss',
               'MCC': make_scorer(matthews_corrcoef)}
    


search_XG = RandomizedSearchCV(XG_model, parameters_XG, scoring = multi_score, 
                                n_jobs = -1, cv = cv_RSKFCV, n_iter = 200, refit = 'neg_log_loss',
                                random_state = 2504).fit(X_train, y_train)

EDIT: I understand why it throws the warnings/errors, however what I don't understand is why now it won't fit at all? I would expect just many values to be inf/nan but currently it throws a traceback to the random_state = 2504).fit(X_train, y_train) How can I resolve this?


Solution

  • When calculating Matthews correlation your are dividing two values. The problem you obtain is that you are dividing by 0 (difficult to deal with).

    This is probably due to the model is always predicting one class (for example, TP and FP will be 0, and denominator will be 0 too). To solve it you may adjust parameters you are using in order to avoid these "silly" models that only predicts one class.

    You can also avoid the Matthews correlation when doing RandomizedSearchCV, and only plot it for the final model. But of course, you will miss this score for every iteration of the model.