I am running a random forest regression (RFR) task and I want to apply the Drop-column importance strategy. The basic idea of this strategy is:
to get a baseline performance score as with permutation importance, but then drop a column entirely, retrain the model, and recompute the performance score. The importance value of a feature is then the difference between the baseline and the score from the model missing that feature.
I found this strategy here and here.
Using the ranger
package, how can I implement the above strategy and so in the end I could have the final model with the most important predictors (based on the above strategy) and maybe print the variables?
library(ranger)
train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]
rg.iris <- ranger(Species ~ .,
data = iris.train,
num.trees = 101,
importance = "permutation")
Windows 11, R 4.3.3, RStudio 2023.12.1 Build 402.
I think you could write a loop to do this, but you'd need to give some thought to what you're using to assess each model. Here's some code that assesses using Accuracy:
library(ranger)
library(data.table)
data(iris)
setDT(iris)
set.seed(1234)
# Function to assess performance, here based on Accuracy
performance <- function(model, data) {
predictions <- predict(model, data.frame(data))
return(mean(data$Species == predictions$predictions))
}
# Assess performance of baseline model
baseline_model <- ranger(Species ~ ., data = iris)
baseline_performance <- performance(baseline_model, iris)
# Loop to drop columns and assess performance
importance <- c()
for (feature in names(iris)[names(iris) != "Species"]) {
model <- ranger(Species ~ ., data = iris[,!(names(iris) %in% feature), with = FALSE])
performance_without_feature <- performance(model, iris[,!(names(iris) %in% feature), with = FALSE])
importance[feature] <- baseline_performance - performance_without_feature
}
importance