rcross-validationr-caretr-ranger

Cross Validation on R Ranger Library


Hello I have the following ranger model:

X <- train_df[, -1]
y <- train_df$Price

rf_model <- ranger(Price ~ ., data = train_df, mtry = 11 ,splitrule = "extratrees" ,min.node.size = 1, num.trees =100)

I am trying to accomplish two things,

  1. Give me an average performance metric, cross-validating across non intersecting variance data sets, and give me a more stable accuracy metric, despite the change in seed value
  2. Set up cross validation to find the most optimal mtry, and num.trees combo.

What I have tried:

**The following worked for optimizing for mtry,splitrule and min.node.size, but I can not add the number of trees into the equation, as it gives me an error in the case of doing so. ** # define the parameter grid to search over param_grid <- expand.grid(mtry = c(1:ncol(X)), splitrule = c( "variance", "extratrees", "maxstat"), min.node.size = c(1, 5, 10))

# set up the cross-validation scheme
cv_scheme <- trainControl(method = "cv",
                          number = 5,
                          verboseIter = TRUE)

# perform the grid search using caret
rf_model <- train(x = X,
                  y = y,
                  method = "ranger",
                  trControl = cv_scheme,
                  tuneGrid = param_grid)

# view the best parameter values
rf_model$bestTune

Solution

  • One easy way to do it, is to add a num.trees argument in train and iterate over that argument.

    The other way is to create your customized model see this chapter Using Your Own Model

    there is an RPubs paper by Pham Dinh Khanh demonstrating that here

    library(caret)
    library(mlbench)
    library(ranger)
    data(PimaIndiansDiabetes)
    x=PimaIndiansDiabetes[,-ncol(PimaIndiansDiabetes)]
    y=PimaIndiansDiabetes[,ncol(PimaIndiansDiabetes)]
    
    param_grid=expand.grid(mtry = c(1:4),
                           splitrule = c( "variance", "extratrees"),
                           min.node.size = c(1, 5))
    cv_scheme <- trainControl(method = "cv",
                              number = 5,
                              verboseIter = FALSE)
    models=list()
    for (ntree in c(4,100)){
    set.seed(123)
    rf_model <- train(x = x,
                      y = y,
                      method = "ranger",
                      trControl = cv_scheme,
                      tuneGrid = param_grid,
                      num.trees=ntree)
    name=paste0(ntree,"_tr_model")
    models[[name]]=rf_model
    }
    
    models[["4_tr_model"]]
    #> Random Forest 
    #> 
    #> 768 samples
    #>   8 predictor
    #>   2 classes: 'neg', 'pos' 
    #> 
    #> No pre-processing
    #> Resampling: Cross-Validated (5 fold) 
    #> Summary of sample sizes: 614, 615, 614, 615, 614 
    #> Resampling results across tuning parameters:
    #> 
    #>   mtry  splitrule   min.node.size  Accuracy   Kappa    
    #>   1     variance    1                    NaN        NaN
    #>   1     variance    5                    NaN        NaN
    #>   1     extratrees  1              0.6808675  0.2662428
    #>   1     extratrees  5              0.6783125  0.2618862
    ...
    
    models[["100_tr_model"]]
    #> Random Forest 
    ...
    #> 
    #>   mtry  splitrule   min.node.size  Accuracy   Kappa    
    #>   1     variance    1                    NaN        NaN
    #>   1     variance    5                    NaN        NaN
    #>   1     extratrees  1              0.7473559  0.3881530
    #>   1     extratrees  5              0.7564808  0.4112127
    ...
    
    

    Created on 2023-04-19 with reprex v2.0.2