scikit-learnclassificationpipelineshaptreemodel

Error in Shap plots (tree explainer) when using sklearn pipeline for Classification task


I am using a sklearn pipeline for a classification task as below:

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder

import shap


# -----------------------------------------------------------------------------
# Data
# -----------------------------------------------------------------------------

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

categorical_columns = ["pclass", "sex", "embarked"]
numerical_columns = ["age", "sibsp", "parch", "fare"]

X = X[categorical_columns + numerical_columns]   # [1309, 7] , there is Nan values.
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)


# -----------------------------------------------------------------------------
# Data preprocessing
# -----------------------------------------------------------------------------


categorical_encoder = OrdinalEncoder(
    handle_unknown="use_encoded_value", unknown_value=-1, encoded_missing_value=-1
)

numerical_imputer = SimpleImputer(strategy="mean")


preprocessing = ColumnTransformer(
    [
        ("cat", categorical_encoder, categorical_columns),   
        ("num", numerical_imputer, numerical_columns),
    ],
    verbose_feature_names_out=False,   
)


# -----------------------------------------------------------------------------
# Pipeline
# -----------------------------------------------------------------------------

rf = Pipeline(
    [
        ("preprocess", preprocessing),
        ("classifier", RandomForestClassifier(random_state=42)),
    ]
)


rf.fit(X_train, y_train)


print(f"RF train accuracy: {rf.score(X_train, y_train):.3f}")
print(f"RF test accuracy: {rf.score(X_test, y_test):.3f}")


# -----------------------------------------------------------------------------
# Shap
# -----------------------------------------------------------------------------

explainer = shap.Explainer(rf["classifier"], feature_names=rf["preprocess"].get_feature_names_out())

X_test_processed = rf['preprocess'].transform(X_test)

shap_values = explainer(X_test_processed)

However when I try to get Shap plots, I get the following errors:

shap.summary_plot(shap_values, X_test_processed)

shap.summary_plot(shap_values, X_test_processed, plot_type="bar")

shap.plots.beeswarm(shap_values)

shap.plots.bar(shap_values)

What am I doing wrong? Please let me know any idea to solve this issue.


Solution

  • After that, the shap_values has shape (328, 7, 2), and the beeswarm error message is most useful: everything is expecting just a 2d array. For some reason you've got explanations for both classes instead of just the positive class (the explanations of the negative class are just the opposites of those); putting [:, :, 1] after shap_values everywhere and the plots display for me.

    I've tested this on Colab, but that doesn't readily allow python 3.8 and hence not sklearn 1.1, so I've had to make some modifications. Let me know if it doesn't work for you and I'll put together a closer environment locally.

    It'd still be nice to understand why shap is giving you explanations for both classes. I did notice that y has categorical pandas type, but casting to int doesn't change anything.