rrandom-forestr-caretinterpretation

Random Forest - how mtry is larger than total number of independent variable?


1) I tried regression Random Forest for training set of 185 rows with 4 independent variables. 2 categorical variables have each of 3 levels and 13 levels. Another 2 variables are numeric continuous variables.

I tried RF with cross validation of 10 fold repeated 4 times. (I didn't scale dependent variable and that's why RMSE is so big.)

I guess the reason mtry is bigger than 4 is that the categorical variables has 3+13= 16 levels total. But if so, why it does not include the numeric variables number?

185 samples
4 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times) 
Summary of sample sizes: 168, 165, 166, 167, 166, 167, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE    
   2    16764183  0.7843863  9267902
   9     9451598  0.8615202  3977457
  16     9639984  0.8586409  3813891

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.

Please help me on understanding mtry.

2) Also, each fold sample size is 168,165,166,...., and why the sample size is changing?

sample sizes: 168, 165, 166, 167, 166, 167

Thank you so much.


Solution

  • You are correct in that there are 16 variables to sample from, hence the maximum for mtry is 16.

    The values chosen by caret is based on two parameters, in train, there is an option for tuneLength which is by default 3:

    tuneLength = ifelse(trControl$method == "none", 1, 3)
    

    This means it test three values. For randomForest, you have mtry, and the default is:

    caret::getModelInfo("rf")[[1]]$grid
    #> function (x, y, len = NULL, search = "grid") 
    #> {
    #>     if (search == "grid") {
    #>         out <- data.frame(mtry = caret::var_seq(p = ncol(x), 
    #>             classification = is.factor(y), len = len))
    #>     }
    #>     else {
    #>         out <- data.frame(mtry = unique(sample(1:ncol(x), size = len, 
    #>             replace = TRUE)))
    #>     }
    #>     out
    #> }
    

    Created on 2022-07-01 by the reprex package (v2.0.1)

    Since you have 16 columns, it becomes:

    var_seq(16,len=3)
    [1]  2  9 16
    

    You can test the mtry of your choice by setting:

    library(caret)
    trCtrl = trainControl(method="repeatedcv",repeats=4,number=10)
    # we test 2,4,6..16
    trg = data.frame(mtry=seq(2,16,by=2))
    # some random data for example
    df = data.frame(y=rnorm(200),x1 = sample(letters[1:13],200,replace=TRUE),
    x2=sample(LETTERS[1:3],200,replace=TRUE),x3=rpois(200,10),x4=runif(200))
    
    #fit
    mdl = train(y ~.,data=df,tuneGrid=trg,trControl =trCtrl)
    
    Random Forest 
    
    200 samples
      4 predictor
    
    No pre-processing
    Resampling: Cross-Validated (10 fold, repeated 4 times) 
    Summary of sample sizes: 180, 180, 180, 180, 180, 180, ... 
    Resampling results across tuning parameters:
    
      mtry  RMSE      Rsquared    MAE      
       2    1.120216  0.04448700  0.8978851
       4    1.157185  0.04424401  0.9275939
       6    1.172316  0.04902991  0.9371778
       8    1.186861  0.05276752  0.9485516
      10    1.193595  0.05490291  0.9543479
      12    1.200837  0.05608624  0.9574420
      14    1.205663  0.05374614  0.9621094
      16    1.210783  0.05537412  0.9665665
    
    RMSE was used to select the optimal model using the smallest value.
    The final value used for the model was mtry = 2.