1) I tried regression Random Forest for training set of 185 rows with 4 independent variables. 2 categorical variables have each of 3 levels and 13 levels. Another 2 variables are numeric continuous variables.
I tried RF with cross validation of 10 fold repeated 4 times. (I didn't scale dependent variable and that's why RMSE is so big.)
I guess the reason mtry is bigger than 4 is that the categorical variables has 3+13= 16 levels total. But if so, why it does not include the numeric variables number?
185 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 168, 165, 166, 167, 166, 167, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 16764183 0.7843863 9267902
9 9451598 0.8615202 3977457
16 9639984 0.8586409 3813891
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 9.
Please help me on understanding mtry.
2) Also, each fold sample size is 168,165,166,...., and why the sample size is changing?
sample sizes: 168, 165, 166, 167, 166, 167
Thank you so much.
You are correct in that there are 16 variables to sample from, hence the maximum for mtry is 16.
The values chosen by caret is based on two parameters, in train, there is an option for tuneLength which is by default 3:
tuneLength = ifelse(trControl$method == "none", 1, 3)
This means it test three values. For randomForest, you have mtry, and the default is:
caret::getModelInfo("rf")[[1]]$grid
#> function (x, y, len = NULL, search = "grid")
#> {
#> if (search == "grid") {
#> out <- data.frame(mtry = caret::var_seq(p = ncol(x),
#> classification = is.factor(y), len = len))
#> }
#> else {
#> out <- data.frame(mtry = unique(sample(1:ncol(x), size = len,
#> replace = TRUE)))
#> }
#> out
#> }
Created on 2022-07-01 by the reprex package (v2.0.1)
Since you have 16 columns, it becomes:
var_seq(16,len=3)
[1] 2 9 16
You can test the mtry of your choice by setting:
library(caret)
trCtrl = trainControl(method="repeatedcv",repeats=4,number=10)
# we test 2,4,6..16
trg = data.frame(mtry=seq(2,16,by=2))
# some random data for example
df = data.frame(y=rnorm(200),x1 = sample(letters[1:13],200,replace=TRUE),
x2=sample(LETTERS[1:3],200,replace=TRUE),x3=rpois(200,10),x4=runif(200))
#fit
mdl = train(y ~.,data=df,tuneGrid=trg,trControl =trCtrl)
Random Forest
200 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 4 times)
Summary of sample sizes: 180, 180, 180, 180, 180, 180, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 1.120216 0.04448700 0.8978851
4 1.157185 0.04424401 0.9275939
6 1.172316 0.04902991 0.9371778
8 1.186861 0.05276752 0.9485516
10 1.193595 0.05490291 0.9543479
12 1.200837 0.05608624 0.9574420
14 1.205663 0.05374614 0.9621094
16 1.210783 0.05537412 0.9665665
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.