Let me preface this by saying in the past two days I have taught myself vaguely how to use this program so it is totally possible I making an incredibly simple mistake, but any help is greatly appreciated. I am trying to use SHAP's waterfall to visualize the impact of various variables on the prediction of my XGBoost model. The model takes in 13 variables about the team's salary and then predicts the team's rank. The model works well but when I try to use SHAP their values are wrong. As far as I understand, the f(x) in the top right of the waterfall is supposed to be the same as the model prediction but that is not at all the case.
Here is my code:
import shap
from joblib import dump, load
import xgboost as xgb
import pandas as pd
import numpy as np
filen = f"D:\miniconda-keep\Created Data\Done Data - Copy.csv"
X = pd.read_csv(filen).iloc[:,3:-3].div(10000).astype(int)
y = pd.read_csv(filen).iloc[:,-1:].astype(int).subtract(1)
model = load("D:\miniconda-keep\Saved Will Made Files\Models\Successful_XGBoost_Model.joblib")
explainer = shap.Explainer(model)
shap_values = explainer(X)
pred = model.predict(X)
#EDIT THE VARIABLE BELOW TO LOOK AT DIFFERENT TEAMS
to_pred = 21
print(X.iloc[to_pred].subtract(X.mean(axis=0)))
print('Team:',pd.read_csv(filen).iloc[to_pred,1])
print(f"model pred {pred[to_pred]+1}")
shap.plots.waterfall(shap_values[to_pred,:,pred[to_pred]])
This is the output and waterfall plot:
Average Salary 73.311005
Highest Salary 1280.382775
Number of Homegrowns 0.000000
Salary IQR 25.593301
Salary Standard Deviation 239.521531
Average GK Salary 5.866029
Average Defender Salary 10.866029
Average Midfielder Salary 118.449761
Average Attacker Salary 137.674641
Highest Goalkeeper Salary 32.688995
dtype: float64
Team: Toronto FC
model pred 15
A working example from SHAP's API Examples page:
import xgboost
import shap
# train XGBoost model
X, y = shap.datasets.adult()
model = xgboost.XGBClassifier().fit(X, y)
# compute SHAP values
explainer = shap.Explainer(model, X)
shap_values = explainer(X)
shap.plots.waterfall(shap_values[0])
This outputs:
Thank you so much for any help!
The SHAP output was the log odds of the model making that prediction, per @MichaelM. This was because the model was an XGBoost classifier, not a regressor. With an XGBoost regressor, the SHAP value is in fact the model prediction.
With an XGBoost regressor:
import xgboost
import shap
import pandas as pd
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import cross_val_score
from numpy import absolute
filen = "D:\miniconda-keep\Created Data\Done Data - Copy.csv"
X, y=pd.read_csv(filen).iloc[:,3:-3].div(10000), pd.read_csv(filen).iloc[:,-1:]
model = xgboost.XGBRegressor()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# force scores to be positive
scores = absolute(scores)
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )
model.fit(X, y)
explainer = shap.Explainer(model)
shap_values = explainer(X)
#EDIT VARIABLE BELOW TO CHANGE TEAM
row = 2
# visualize the first prediction's explanation
print(pd.read_csv(filen).iloc[row,1])
print(pd.read_csv(filen).iloc[row,-1])
shap.plots.waterfall(shap_values[row])
This outputs:
Mean MAE: 3.348 (0.495)
FC Cincinnati
1.0