pythonscikit-learnxgboostgridsearchcv

Discrepancy between best_score_ and highest mean_test_score in cv_results_ of HalvingGridSearchCV


Problem

I'm using HalvingGridSearchCV in scikit-learn for hyperparameter tuning of an XGBoost model in an imbalanced-learn pipeline. I've noticed that the best_score_ (and consequently best_params_) does not align with the highest mean_test_score in cv_results_. This discrepancy is puzzling to me, especially since it can be substantial in some cases; I initially expected that the best_score_ would match the highest mean_test_score from cv_results_, as this score typically represents the best-performing model.

Is it reasonable to consider the model with the highest mean_test_score in cv_results_ as a valid choice for deployment instead of best_estimator_? What are the trade-offs to consider?

As I can't provide the full pipeline or data I'm sharing the relevant parts of the code:

from imblearn.ensemble import BalancedBaggingClassifier
from imblearn import FunctionSampler
from imblearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit

# data
X, y = ...

model = Pipeline([
        ('sampling', FunctionSampler(
            ...
        )),
        ('classification', BalancedBaggingClassifier(
            base_estimator=XGBClassifier(
                            eval_metric='aucpr', use_label_encoder=False)
         ))
    ])

params = {
            'classification__base_estimator__max_depth': [3, 5, 7, 10],
            'classification__base_estimator__gamma': [0., 1e-4, 1e-2, 0.1, 1.]
        }

cv = TimeSeriesSplit(n_splits=5)
clf = HalvingGridSearchCV(
            estimator=model,
            param_grid=params,
            scoring='average_precision',
            factor=3,
            min_resources=2500,
            cv=cv,
            verbose=1,
            refit=True,
        )
clf.fit(X, y)

Results:

The combination of parameters corresponding to the supposedly best score sometimes doesn't even appear in the top 5 highest mean test scores.

To sum up:

Any insights would be greatly appreciated. While I can't share the actual data, I can provide additional code snippets or details for context. Thank you!


Solution

  • The best_score_ and associated parameters always correspond to the last iteration, i.e. the one with maximum resources. In the example in the User Guide, the same thing happens: earlier iterations actually have higher mean test score, but those are not taken into consideration for selecting the winner.

    Broadly, you do expect the latter iterations to perform better because they have more resources. When the resource is rows (the default), the test folds are also subsampled, which might lead to this situation (the test scores are more noisy, so just be chance are sometimes higher). So I would be hesitant to select a different set of parameters based on earlier iterations.