I am trying to tune the hyperparameters of a random forest regression model and all of the accuracy measures are exactly the same, regardless of changes to hyperparameters. I've tested the same code on the "diamonds" dataset and have been able to reproduce the problem. Here is my code:
train = diamonds[,c(1, 5, 8:10)]
x = c(1:6)
folds = sample(x,size = nrow(diamonds), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model.rds")
write.csv(results1, "sample_model.csv", row.names = FALSE)
Here's what I get for results:
What the heck?
UPDATE: I reduced the sample size to 1000 to allow for faster processing and got different results, still all identical to each other. Code:
train = diamonds[,c(1, 5, 8:10)]
train = train[c(1:1000),]
x = c(1:6)
folds = sample(x,size = nrow(train), replace = T)
rf_grid = expand.grid(.mtry = c(2:4),
.splitrule = "variance",
.min.node.size = 20)
set.seed(105)
model <- train(train[, c(2:5)],
train$carat,
method="ranger",
importance = "impurity",
metric = "RMSE",
tuneGrid = rf_grid,
trControl = trainControl(method="cv",
index=folds,
search = "random"),
num.trees = 10,
tuneLength = 10)
results1 <- as.data.frame(model$results)
results1$ntree <- 10
results1$sample.size <- nrow(train)
saveRDS(model, "sample_model2.rds")
write.csv(results1, "sample_model2.csv", row.names = FALSE)
Results:
This seems to be an issue with your cross-validation folds. When I run your code and look at the results of model
it says:
Summary of sample sizes: 1, 1, 1, 1, 1, 1, ...
indicating that each fold only has a sample size of 1.
I think if you define folds
like this, it will work more like you're expecting it to:
folds<-createFolds(train$carat, k = 6, returnTrain=TRUE)
The results then look like this:
Random Forest
1000 samples
4 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 832, 833, 835, 834, 834, 832, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 0.01582362 0.9933839 0.00985451
3 0.01601980 0.9932625 0.00994588
4 0.01567161 0.9935624 0.01018242
Tuning parameter 'splitrule' was held constant at a value
of variance
Tuning parameter 'min.node.size' was held constant
at a value of 20
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 4, splitrule
= variance and min.node.size = 20.