rregressionrandom-forestfeature-selectionr-ranger

Random forest regression: drop-column importance


I am running a random forest regression (RFR) task and I want to apply the Drop-column importance strategy. The basic idea of this strategy is:

to get a baseline performance score as with permutation importance, but then drop a column entirely, retrain the model, and recompute the performance score. The importance value of a feature is then the difference between the baseline and the score from the model missing that feature.

I found this strategy here and here.

Using the ranger package, how can I implement the above strategy and so in the end I could have the final model with the most important predictors (based on the above strategy) and maybe print the variables?

library(ranger)

train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]

rg.iris <- ranger(Species ~ ., 
              data = iris.train, 
              num.trees = 101, 
              importance = "permutation")

Windows 11, R 4.3.3, RStudio 2023.12.1 Build 402.


Solution

  • I think you could write a loop to do this, but you'd need to give some thought to what you're using to assess each model. Here's some code that assesses using Accuracy:

    library(ranger)
    library(data.table)
    
    data(iris)
    setDT(iris)
    set.seed(1234)
    
    # Function to assess performance, here based on Accuracy
    performance <- function(model, data) {
      predictions <- predict(model, data.frame(data))
      return(mean(data$Species == predictions$predictions))
    }
    
    # Assess performance of baseline model
    baseline_model <- ranger(Species ~ ., data = iris)
    baseline_performance <- performance(baseline_model, iris)
    
    # Loop to drop columns and assess performance
    importance <- c()
    for (feature in names(iris)[names(iris) != "Species"]) {
      model <- ranger(Species ~ ., data = iris[,!(names(iris) %in% feature), with = FALSE])
      performance_without_feature <- performance(model, iris[,!(names(iris) %in% feature), with = FALSE])
      importance[feature] <- baseline_performance - performance_without_feature
    }
    
    importance