rmachine-learningrandom-forestshap

Bee swarm plot with SHAP values for Random forest


In the code provided below I am to visualize the results of SHAP values of a random forest model.

The code is in R and it is shown below:

# Load necessary libraries
library(randomForest)
library(DALEX)
library(beeswarm)

data <- my_database

# Splitting the data into features and target
features <- data[, -which(names(data) %in% "Clus.1")]
target <- data$Clus.1

# Train a random forest model
rf_model <- randomForest(features, target)

# Create an explainer object
explainer <- DALEX::explain(rf_model, data = features, y = target)

# Compute SHAP values
shapley_values <- DALEX::predict_parts(explainer, new_observation = features)

# Plot bee swarm
beeswarm(shapley_values$shap_1)

I have tried to use beeswarm package

and I ended up with this error:

beeswarm(shapley_values$shap_1)
Error in rep(nms, sapply(x, length)) : invalid 'times' argument

Can you please suggest me what is wrong about the beeswarm, or other similiar packages?

Output of what I am trying to do

And this is the output I am getting if I use plot(shapley_values)


Solution

  • {DALEX} does not support plotting/working with SHAP values of multiple observations. Plotting SHAP beeswarm plots is easy with {shapviz}. Calculating SHAP values can done by different packages, e.g., {kernelshap}, {fastshap}, or {treeshap}.

    Note that random forests are one of the worst for SHAP, because trees are deep and predictions are very slow.

    Kernel SHAP or permutation SHAP

    library(randomForest)
    library(kernelshap)  # or library(treeshap)
    library(shapviz)
    
    fit <- randomForest(Sepal.Length ~ ., data = iris)
    
    xvars <- setdiff(colnames(iris), "Sepal.Length")
    
    # Or kernelshap() if length(xvars) is >10. Subsample bg_X to 100-500 rows
    shap_values <- permshap(fit, X = iris, bg_X = iris, feature_names = xvars)
    shap_values <- shapviz(shap_values)
    sv_importance(shap_values, kind = "bee")
    

    enter image description here