rmachine-learningrandom-forestshap

Bee swarm plot with SHAP values for Random forest


I am trying to visualize the results for SHAP values using a random forest before.

I am working in this piece of code:

# Load necessary libraries
library(randomForest)
library(DALEX)
library(beeswarm)

data <- turismo_rf

# Split the data into features and target
features <- data[, -which(names(data) %in% "Clus.1")]
target <- data$Clus.1

# Train a random forest model
rf_model <- randomForest(features, target)

# Create an explainer object
explainer <- DALEX::explain(rf_model, data = features, y = target)

# Compute SHAP values
shapley_values <- DALEX::predict_parts(explainer, new_observation = features)

# Plot bee swarm
beeswarm(shapley_values$shap_1)

The problem is that I have tried using beeswarm package

but I always get this error:

beeswarm(shapley_values$shap_1)
Error in rep(nms, sapply(x, length)) : invalid 'times' argument

Can you please suggest me what is wrong about the beeswarm?

Output I want to do

And this is the output I get if I use just plot(shapley_values)


Solution

  • {DALEX} does not support plotting/working with SHAP values of multiple observations. Plotting SHAP beeswarm plots is easy with {shapviz}. Calculating SHAP values can done by different packages, e.g., {kernelshap}, {fastshap}, or {treeshap}.

    Note that random forests are one of the worst for SHAP, because trees are deep and predictions are very slow.

    Kernel SHAP or permutation SHAP

    library(randomForest)
    library(kernelshap)  # or library(treeshap)
    library(shapviz)
    
    fit <- randomForest(Sepal.Length ~ ., data = iris)
    
    xvars <- setdiff(colnames(iris), "Sepal.Length")
    
    # Or kernelshap() if length(xvars) is >10. Subsample bg_X to 100-500 rows
    shap_values <- permshap(fit, X = iris, bg_X = iris, feature_names = xvars)
    shap_values <- shapviz(shap_values)
    sv_importance(shap_values, kind = "bee")
    

    enter image description here