rrandom-forestr-ranger

Extracting estimates with ranger decision trees


I am getting the error message Error: No tidy method for objects of class ranger when trying to extract the estimates for a regression model built with the ranger package in R.

Here is my code:


# libraries
library(tidymodels)
library(textrecipes)
library(LiblineaR)
library(ranger)
library(tidytext)

# create the recipe
comments.rec <- recipe(year ~ comments, data = oa.comments) %>%
  step_tokenize(comments, token = "ngrams", options = list(n = 2, n_min = 1)) %>%
  step_tokenfilter(comments, max_tokens = 1e3) %>%
  step_stopwords(comments, stopword_source = "stopwords-iso") %>%
  step_tfidf(comments) %>%
  step_normalize(all_predictors())

# workflow with recipe
comments.wf <- workflow() %>%
  add_recipe(comments.rec)

# create the regression model using support vector machine
svm.spec <- svm_linear() %>%
  set_engine("LiblineaR") %>%
  set_mode("regression")

svm.fit <- comments.wf %>%
  add_model(svm.spec) %>%
  fit(data = oa.comments)

# extract the estimates for the support vector machine model
svm.fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  arrange(-estimate)


Below is the table of estimates for each tokenized term in the data set (this is a dirty data set for demo purposes)

   term                     estimate
   <chr>                       <dbl>
 1 Bias                     2015.   
 2 tfidf_comments_2021         0.877
 3 tfidf_comments_2019         0.851
 4 tfidf_comments_2020         0.712
 5 tfidf_comments_2018         0.641
 6 tfidf_comments_https        0.596
 7 tfidf_comments_plan s       0.462
 8 tfidf_comments_plan         0.417
 9 tfidf_comments_2017         0.399
10 tfidf_comments_libraries    0.286

However, when using the ranger engine to create a regression model from random forests, I have no such luck and get the error message above

# create the regression model using random forests
rf.spec <- rand_forest(trees = 50) %>%
  set_engine("ranger") %>%
  set_mode("regression")

rf.fit <- comments.wf %>%
  add_model(rf.spec) %>%
  fit(data = oa.comments)

# extract the estimates for the random forests model
rf.fit %>%
  pull_workflow_fit() %>%
  tidy() %>%
  arrange(-estimate)


Solution

  • To put this back to you in a simpler form that I think highlights the issue - if you had a decision tree model, how would you produce coefficients on the data in the dataset? What would those mean?

    I think what you are looking for here is some form a attribution to each column. There are tools to do this built into tidymodels, but you should read on what it's actually reporting.

    For you, you can get a basic idea of what those numbers would look like by using the vip package, though the produced numbers are definitely not comparable directly to your svm ones.

    install.packages('vip')
    library(vip)  
    
    rf.fit %>%
           pull_workflow_fit() %>%
           vip(geom = "point") + 
           labs(title = "Random forest variable importance") 
    

    You'll produce a plot with relative importance scores. To get the numbers

    rf.fit %>%
       pull_workflow_fit() %>%
       vi()
    

    tidymodels has a decent walkthrough doing this here but, given you have a model that can estimate importance scores you should be good to go.

    Tidymodels tutorial page - 'a case study'

    edit: if you HAVEN'T done this you may need to rerun your initial model with a new parameter passed during the 'set_engine' step of your code that gives ranger an idea of what kind of importance scores you are looking for/how they should be computed.