xgboost

How does XGBoost calculate base_score?


Since XGBoost 2.0 base_score is automatically calculated if it is not specified when initialising an estimator. I naively thought it would simply use the mean of the target, but this does not seem to be the case:

import json
import shap # only for the dataset
import xgboost as xgb

print('shap.__version__:',shap.__version__)
print('xgb.__version__:',xgb.__version__)
print()

X, y = shap.datasets.adult()

estimator = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=200)

estimator.fit(X,y)

print('y.mean():',y.mean())
print("float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']):",float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']))

Output:

shap.__version__: 0.46.0
xgb.__version__: 2.1.0

y.mean(): 0.2408095574460244
float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']): 0.26177529

The difference is too big for a rounding error. So how is base_score calculated? I think this is the relevant commit, but it's hard to tell exactly what it does.


Solution

  • TL;DR

    Before XGBoost 2.0,

    Since XGBoost 2.0, these initializations remain unchanged at first. However, before training the first tree, XGBoost internally fits a one-node tree (a stump) and sets the optimal output weight of that node as the updated base_score.

    This logic can be found in the source code here.

    Simple Derivation

    Classification (binary:logistic)

    Assume:

    Initial prediction for every sample is:

    Then the gradients and Hessians for each sample are:

    Total gradient and Hessian:

    Optimal constant prediction (i.e., the stump leaf value) is:

    This raw value is then passed through the sigmoid function:

    So the new base_score estimated by XGBoost is the probability corresponding to that optimal constant.

    import numpy as np  
      
    def cal_base_score(train_label):
        n = sum(train_label)  # num of 1s
        m = len(train_label) - n  # num of 0s
        w = -2 * (m - n) / (m + n)  
        base_score = 1 / (1 + np.exp(-w))  
        return base_score
    

    Regression (reg:squarederror)

    Initial prediction for all samples:

    Then for each sample:

    Total gradient and Hessian:

    Optimal constant prediction:

    So the new base_score used by XGBoost is simply the mean of the labels: base_score_new = w* = mean of label