pythonmachine-learningscikit-learnxgboost

GridSearchCV not choosing the best hyperparameters for xgboost


I am developing a regression model with xgboost. Since xgboost has multiple hyperparameters, I have added the cross validation logic with GridSearchCV(). As a trial, I set max_depth: [2,3]. My python code is as below.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
from sklearn.metrics import mean_squared_error
​
xgb_reg = xgb.XGBRegressor()
​
# Obtain the best hyper parameter
scorer=make_scorer(mean_squared_error, False)
params = {'max_depth': [2,3], 
          'eta': [0.1], 
          'colsample_bytree': [1.0],
          'colsample_bylevel': [0.3],
          'subsample': [0.9],
          'gamma': [0],
          'lambda': [1],
          'alpha':[0],
          'min_child_weight':[1]
         }
grid_xgb_reg=GridSearchCV(xgb_reg,
                          param_grid=params,
                          scoring=scorer,
                          cv=5,
                          n_jobs=-1)
​
grid_xgb_reg.fit(X_train, y_train)
y_pred = grid_xgb_reg.predict(X_test)
y_train_pred = grid_xgb_reg.predict(X_train)

## Evaluate model
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
​
print('RMSE  train: %.3f,  test: %.3f' %(np.sqrt(mean_squared_error(y_train, y_train_pred)),np.sqrt(mean_squared_error(y_test, y_pred))))
print('R^2   train: %.3f,  test: %.3f' %(r2_score(y_train, y_train_pred),r2_score(y_test, y_pred)))

The problem is the GridSearchCV does not seem to choose the best hyperparameters. In my case, when I set max_depth as [2,3], The result is as follows. In the following case, GridSearchCV chose max_depth:2 as the best hyper params.

#  The result when max_depth is 2
RMSE  train: 11.861,  test: 15.113
R^2   train: 0.817,  test: 0.601

However, if I updated max_depth to [3](by getting rid of 2), the test score is better than the previous value as follows.

#  The result when max_depth is 3
RMSE  train: 9.951,  test: 14.752
R^2   train: 0.871,  test: 0.620

Question

My understanding is that even if I set max_depth as [2,3], the GridSearchCV method SHOULD choose the max_depth:3 as the best hyperparameters since max_depth:3 can return the better score in terms of RSME or R^2 than max_depth:2. Could anyone tell me why my code cannot choose the best hyperparameters when I set max_depth as [2,3]?


Solution

  • If you run a second experiment with max_depth:2, then the results are not comparable to the first experiment with max_depth:[2,3] even for the run with max_depth:2, since there are sources of randomness in your code which you do not explicitly control, i.e. your code is not reproducible.

    The first source of randomness is the CV folds; in order to ensure that the experiments will be run on identical splits of the data, you should define your GridSearchCV as follows:

    from sklearn.model_selection import KFold
    
    seed_cv = 123 # any random value here
    
    kf = KFold(n_splits=5, random_state=seed_cv)
    
    grid_xgb_reg=GridSearchCV(xgb_reg,
                              param_grid=params,
                              scoring=scorer,
                              cv=kf,   # <- change here
                              n_jobs=-1)
    

    The second source of randomness is the XGBRegressor itself, which also includes a random_state argument (see the docs); you should change it to:

    seed_xgb = 456 # any random value here (can even be the same with seed_cv)
    xgb_reg = xgb.XGBRegressor(random_state=seed_xgb)
    

    But even with these arrangements, while your data splits will now be identical, the regression models built will not be necessarily so in the general case; here, if you keep the experiments like that, i.e. first with max_depth:[2,3] and then with max_depth:2, the results will be identical indeed; but if you change it to, say, first with max_depth:[2,3] and then with max_depth:3, they will not, since in the first experiment, the run with max_depth:3 will start with a different state of the random number generator (i.e. the one after the run with max_depth:2 has finished).

    There are limits to how identical you can make different runs under such conditions; for an example of a very subtle difference that nevertheless destroys the exact reproducibility between two experiments, see my answer in Why does the importance parameter influence performance of Random Forest in R?