I'm using BayesSearchCV
from scikit-optimize
to optimise an XGBoost
model to fit some data I have. While the model fits fine, I am puzzled by the scores provided in the diagnostic information and am unable to replicate them.
Here's an example script using the Boston house prices dataset to illustrate my point:
from sklearn.datasets import load_boston
import numpy as np
import pandas as pd
from xgboost.sklearn import XGBRegressor
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.model_selection import KFold, train_test_split
boston = load_boston()
# Dataset info:
print(boston.keys())
print(boston.data.shape)
print(boston.feature_names)
print(boston.DESCR)
# Put data into dataframe and label column headers:
data = pd.DataFrame(boston.data)
data.columns = boston.feature_names
# Add target variable to dataframe
data['PRICE'] = boston.target
# Split into X and y
X, y = data.iloc[:, :-1],data.iloc[:,-1]
# Split into training and validation datasets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, shuffle = True)
# For cross-validation, split training data into 5 folds
xgb_kfold = KFold(n_splits = 5,random_state = 42)
# Run fit
xgb_params = {'n_estimators': Integer(10, 3000, 'uniform'),
'max_depth': Integer(2, 100, 'uniform'),
'subsample': Real(0.25, 1.0, 'uniform'),
'learning_rate': Real(0.0001, 0.5, 'uniform'),
'gamma': Real(0.0001, 1.0, 'uniform'),
'colsample_bytree': Real(0.0001, 1.0, 'uniform'),
'colsample_bylevel': Real(0.0001, 1.0, 'uniform'),
'colsample_bynode': Real(0.0001, 1.0, 'uniform'),
'min_child_weight': Real(1, 6, 'uniform')}
xgb_fit_params = {'early_stopping_rounds': 15, 'eval_metric': 'mae', 'eval_set': [[X_val, y_val]]}
xgb_pipe = XGBRegressor(random_state = 42, objective='reg:squarederror', n_jobs = 10)
xgb_cv = BayesSearchCV(xgb_pipe, xgb_params, cv = xgb_kfold, n_iter = 5, n_jobs = 1, random_state = 42, verbose = 4, scoring = None, fit_params = xgb_fit_params)
xgb_cv.fit(X_train, y_train)
After running this, xgb_cv.best_score_
is 0.816, and xgb_cv.best_index_
is 3. Looking at xgb_cv.cv_results_, I want to find the best scores for each fold:
print(xgb_cv.cv_results_['split0_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split1_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split2_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split3_test_score'][xgb_cv.best_index_], xgb_cv.cv_results_['split4_test_score'][xgb_cv.best_index_])
Which gives:
0.8023562337946979,
0.8337404778903412,
0.861370681263761,
0.8749312273014963,
0.7058815015739375
I'm not sure what's being calculated here, because scoring
is set to None
in my code. XGBoost's documentation isn't much help, but according to xgb_cv.best_estimator_.score?
it's supposed to be the R2 of the predicted values. Anyway, I'm unable to obtain these values when I manually try calculating the score for each fold of the data used in the fit:
# First, need to get the actual indices of the data from each fold:
kfold_indexes = {}
kfold_cnt = 0
for train_index, test_index in xgb_kfold.split(X_train):
kfold_indexes[kfold_cnt] = {'train': train_index, 'test': test_index}
kfold_cnt = kfold_cnt+1
# Next, calculate the score for each fold
for p in range(5): print(xgb_cv.best_estimator_.score(X_train.iloc[kfold_indexes[p]['test']], y_train.iloc[kfold_indexes[p]['test']]))
Which gives me the following:
0.9954929618573786
0.994844803666101
0.9963108152027245
0.9962274544089832
0.9931314653538819
How is BayesSearchCV calculating the scores for each fold, and why can't I replicate them using the score
function? I would be most grateful for any assistance with this issue.
(Also, manually calculating the mean of these scores gives: 0.8156560..., while xgb_cv.best_score_
gives: 0.8159277... Not sure why there's a precision difference here.)
best_estimator_
is the refitted estimator, fitted on the entire training set after choosing the hyperparameters; so scoring it on any portion of the training set will be optimistically biased. To reproduce cv_results_
, you would need to refit estimators to each training fold and score
the corresponding test fold.
Beyond that, there does appear to be more randomness not covered by the XGBoost random_state
. There is another parameter seed
; setting that produces consistent results for me. (There are some older posts here (example) reporting similar issues even with seed
set, but perhaps those have been resolved by newer versions of xgb.)