Several publications highlight that there may be biases in variable importance scores derived from machine learning models. A recent study shows by Loh and Zou (2021) shows that ranger permutation-based variable importance scores produce unbiased results.
I am using tidymodels with a ranger engine to estimate random forest model. How can I get ranger variable importance scores from the resulting fit? What is the difference between the variable importance scores from vip? From my understanding, the vip in the example below is the random forest model-specific gini importance.
library(tidymodels)
library(vip)
aq <- na.omit(airquality)
model_rf <-
rand_forest(mode = "regression") %>%
set_engine("ranger", importance = "permutation") %>%
fit(Ozone ~ ., data = aq)
# variable importance
vip:::vi(model_rf)
I think you want to change the value of the importance argument to get the unbiased estimates. ranger has a function to get the importance scores and the model-specific method in the vi package:
library(tidymodels)
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
aq <- na.omit(airquality)
set.seed(1)
model_rf <-
rand_forest(mode = "regression") %>%
set_engine("ranger", importance = "impurity_corrected") %>%
fit(Ozone ~ ., data = aq)
model_rf %>%
extract_fit_engine() %>%
ranger::importance() %>%
sort(decreasing = TRUE)
#> Temp Wind Solar.R Month Day
#> 27919.050 23028.379 6830.772 3077.430 1597.355
# the same as using ranger directly
vip:::vi(model_rf)
#> # A tibble: 5 × 2
#> Variable Importance
#> <chr> <dbl>
#> 1 Temp 27919.
#> 2 Wind 23028.
#> 3 Solar.R 6831.
#> 4 Month 3077.
#> 5 Day 1597.
Created on 2022-11-16 by the reprex package (v2.0.1)