rregressionrandom-foresttidymodelsr-ranger

Assigning weights to variables in random forest model in R


I am trying to fit a random forest model using 'ranger' in 'tidymodels', and get an error while assigning weights to predictor variables. In reproducible code below, 'Petal.Length' and 'Petal.Width' from 'iris' dataset are predictor variables, and I am trying to multiply them by 1 and 2 respectively, as I know that Petal.Width is twice as important as Petal.Length. 'Sepal.Length' and 'Sepal.Width' are dependent variables. Output dataframe should have two additional columns called 'Sepal.Length_pred' and 'Sepal.Width_pred' with predicted values. Error is copied below. Help appreciated!

library(tidyverse)
library(tidymodels)
library(Metrics)

set.seed(12)
split_try<- initial_split(iris, prop = 0.7)
train_try <- split_try %>%  training() %>% na.omit()
test_try <- split_try %>%  testing()  %>% na.omit()

col_try <- names(train_try)[1:2]
ranger_try <- vector('list', length(col_try))
output_try <- vector('list', length(test_try))

wts = c(1,2)

for (i in seq_along(col_try)) {
  ranger_try[[i]] <- rand_forest(trees = 10, mode = "regression") %>%
  set_engine("ranger") %>%
  fit(as.formula(paste(col_try[i], "~ Petal.Length + Petal.Width")), data = train_try, weights = wts)
  
  output_try[[i]] <- predict(ranger_try[[i]],  test_try)
  names(output_try[[i]]) <- paste0(col_try[i],"_pred")
  test_try<- cbind(test_try, data.frame(output_try[i][[1]]))
}

Error in model.frame.default(formula, data, weights = wts) : variable lengths differ (found for '(weights)')


Solution

  • Weighting is typically done on a case (or observation, or row) basis, e.g., giving greater weight to observations with lower variance or greater weight to underrepresented subgroups.

    To use the sepal data, we can make it a long table using tidyr, and use the petal dimensions as cases. This is a little different from weighting variables (or columns).

    Then we create a weighting column. As we weight rows, we need as many weight values as rows. This is why you got a "variables length differ error": you had as many weight values as variables, rather than cases.

    To be recognised as weights, you need to apply the function 'importance_weights'. I've put in some is_case_weights to show you the case weights being propagated through the workflow.

    This runs fine.

    library(tidyverse)
    library(tidymodels)
    
    df <- iris
    df <- df |> mutate(Sepal.Length = NULL, Species = NULL) |>
      pivot_longer(names_to = "Feature", cols = Petal.Length:Petal.Width) |> 
      mutate(
        wts = if_else(Feature == "Petal.Width", 2, 1),
        wts = importance_weights(wts))
    hardhat::is_case_weights(df$wts)
    
    set.seed(12)
    split_try <- initial_split(df, prop = 0.7)
    train_try <- split_try |>  training() |> na.omit()
    test_try <- split_try |>  testing()  |> na.omit()
    hardhat::is_case_weights(train_try$wts)
    
    model_try <-
      rand_forest() |>
      set_engine('ranger') %>%
      set_mode('regression')
    
    wflow_try <- 
      workflow() |> 
      add_model(model_try) |> 
      add_formula(Sepal.Width ~ value) |> 
      add_case_weights(wts)
    wflow_try
    workflows:::has_case_weights(wflow_try)
    
    fit(wflow_try, data = train_try)