How do we make sense of SHAP explainer.expected_value
? Why is it not the same with y_train.mean()
after sigmoid transformation?
Below is a summary of the code for quick reference. Full code available in this notebook: https://github.com/MenaWANG/ML_toy_examples/blob/main/explain%20models/shap_XGB_classification.ipynb
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
explainer = shap.Explainer(model)
shap_test = explainer(X_test)
shap_df = pd.DataFrame(shap_test.values)
#For each case, if we add up shap values across all features plus the expected value, we can get the margin for that case, which then can be transformed to return the predicted prob for that case:
np.isclose(model.predict(X_test, output_margin=True),explainer.expected_value + shap_df.sum(axis=1))
#True
But why isn't the below true? Why after sigmoid transformation, the explainer.expected_value
is not the same with y_train.mean()
for XGBoost classifiers?
expit(explainer.expected_value) == y_train.mean()
#False
SHAP is guaranteed to be additive in raw space (logits). To understand why additivity in raw scores doesn't extend to additivity in class predictions you may think for a while why exp(x+y) != exp(x) + exp(y)
Re: Just keen to understand how was explainer.expected_value calculated for XGBoost classifier. Do you happen to know?
As I stated in comments expected value comes either from the model trees or from your data.
Let's try reproducible:
from sklearn.model_selection import train_test_split
import xgboost
import shap
X, y = shap.datasets.adult()
X_display, y_display = shap.datasets.adult(display=True)
# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
d_train = xgboost.DMatrix(X_train, label=y_train)
d_test = xgboost.DMatrix(X_test, label=y_test)
params = {
"eta": 0.01,
"objective": "binary:logistic",
"subsample": 0.5,
"base_score": np.mean(y_train),
"eval_metric": "logloss",
}
model = xgboost.train(
params,
d_train,
num_boost_round=5000,
evals=[(d_test, "test")],
verbose_eval=100,
early_stopping_rounds=20,
)
explainer = shap.TreeExplainer(model)
ev_trees = explainer.expected_value[0]
from shap.explainers._tree import XGBTreeModelLoader
xgb_loader = XGBTreeModelLoader(model)
ts = xgb_loader.get_trees()
v = []
for t in ts:
v.append(t.values[0][0])
sv = sum(v)
import struct
from scipy.special import logit
size = struct.calcsize('f')
buffer = model.save_raw().lstrip(b'binf')
v = struct.unpack('f', buffer[0:0+size])[0]
# if objective "binary:logistic" or "reg:logistic"
bv = logit(v)
ev_trees_raw = sv+bv
np.isclose(ev_trees, ev_trees_raw)
True
background = X_train[:100]
explainer = shap.TreeExplainer(model, background)
ev_background = explainer.expected_value
Take a note that:
np.isclose(ev_trees, ev_background)
False
but
d_train_background = xgboost.DMatrix(background, y_train[:100])
preds = model.predict(d_train_background, pred_contribs = True)
np.isclose(ev_background, preds.sum(1).mean())
True
or simply
output_margin = model.predict(d_train_background, output_margin=True)
np.isclose(ev_background, output_margin.mean())
True