Several publications highlight that there may be biases in variable importance scores derived from machine learning models. A recent study shows by Loh and Zou (2021) shows that ranger
permutation-based variable importance scores produce unbiased results.
I am using tidymodels
with a ranger
engine to estimate random forest model. How can I get ranger
variable importance scores from the resulting fit? What is the difference between the variable importance scores from vip
? From my understanding, the vip in the example below is the random forest model-specific gini importance.
library(tidymodels)
library(vip)
aq <- na.omit(airquality)
model_rf <-
rand_forest(mode = "regression") %>%
set_engine("ranger", importance = "permutation") %>%
fit(Ozone ~ ., data = aq)
# variable importance
vip:::vi(model_rf)
I think you want to change the value of the importance
argument to get the unbiased estimates. ranger
has a function to get the importance scores and the model-specific method in the vi
package:
library(tidymodels)
library(vip)
#>
#> Attaching package: 'vip'
#> The following object is masked from 'package:utils':
#>
#> vi
aq <- na.omit(airquality)
set.seed(1)
model_rf <-
rand_forest(mode = "regression") %>%
set_engine("ranger", importance = "impurity_corrected") %>%
fit(Ozone ~ ., data = aq)
model_rf %>%
extract_fit_engine() %>%
ranger::importance() %>%
sort(decreasing = TRUE)
#> Temp Wind Solar.R Month Day
#> 27919.050 23028.379 6830.772 3077.430 1597.355
# the same as using ranger directly
vip:::vi(model_rf)
#> # A tibble: 5 × 2
#> Variable Importance
#> <chr> <dbl>
#> 1 Temp 27919.
#> 2 Wind 23028.
#> 3 Solar.R 6831.
#> 4 Month 3077.
#> 5 Day 1597.
Created on 2022-11-16 by the reprex package (v2.0.1)