Since XGBoost 2.0 base_score
is automatically calculated if it is not specified when initialising an estimator. I naively thought it would simply use the mean of the target, but this does not seem to be the case:
import json
import shap # only for the dataset
import xgboost as xgb
print('shap.__version__:',shap.__version__)
print('xgb.__version__:',xgb.__version__)
print()
X, y = shap.datasets.adult()
estimator = xgb.XGBClassifier(
objective='binary:logistic',
n_estimators=200)
estimator.fit(X,y)
print('y.mean():',y.mean())
print("float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']):",float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']))
Output:
shap.__version__: 0.46.0
xgb.__version__: 2.1.0
y.mean(): 0.2408095574460244
float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']): 0.26177529
The difference is too big for a rounding error. So how is base_score
calculated? I think this is the relevant commit, but it's hard to tell exactly what it does.
Before XGBoost 2.0,
binary:logistic
), base_score
was initialized to 0.5 (i.e., a probability; the raw score is logit(0.5) = 0).reg:squarederror
), base_score
was initialized to 0.Since XGBoost 2.0, these initializations remain unchanged at first. However, before training the first tree, XGBoost internally fits a one-node tree (a stump) and sets the optimal output weight of that node as the updated base_score
.
This logic can be found in the source code here.
binary:logistic
)Assume:
Initial prediction for every sample is:
Then the gradients and Hessians for each sample are:
Total gradient and Hessian:
Optimal constant prediction (i.e., the stump leaf value) is:
This raw value is then passed through the sigmoid function:
So the new base_score
estimated by XGBoost is the probability corresponding to that optimal constant.
import numpy as np
def cal_base_score(train_label):
n = sum(train_label) # num of 1s
m = len(train_label) - n # num of 0s
w = -2 * (m - n) / (m + n)
base_score = 1 / (1 + np.exp(-w))
return base_score
reg:squarederror
)Initial prediction for all samples:
Then for each sample:
Total gradient and Hessian:
Optimal constant prediction:
So the new base_score
used by XGBoost is simply the mean of the labels:
base_score_new = w* = mean of label