I have been using a gbm
in the caret
package in Rstudio
to find the probability for the occurrence of a failure.
I have used Youden's J to find a threshold for the best classification, which is 0.63. How do I now use this threshold? I presume the best way to do this is to somehow incorporated the threshold into the gbm
model in caret
to get more accurate predictions, and then rerun the model on the training data again? Currently it defaults to 0.5 and I can't find an obvious way to update the threshold.
Alternatively, is the threshold just used to separate the test data predictions into the correct class? This seems more straight forward, but how then do I reflect the change in the ROC_AUC plot, assuming the probability should be updated based on the new threshold?
Any help would be gratefully received. Thanks
EDIT: The full code I am working on is as follows:
library(datasets)
library(caret)
library(MLeval)
library(dplyr)
data(iris)
data <- as.data.frame(iris)
# create class
data$class <- ifelse(data$Species == "setosa", "yes", "no")
# split into train and test
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)
# Set up control function for training
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)
# Set up trainng grid - this is based on a hyper-parameter tune that was recently done
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 20000,
shrinkage = 0.01,
n.minobsinnode = 4)
# Build a standard classifier using a gradient boosted machine
set.seed(5627)
gbm_iris <- train(class ~ .,
data = train,
method = "gbm",
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
# Calcuate best thresholds
caret::thresholder(gbm_iris, threshold = seq(.01,0.99, by = 0.01), final = TRUE, statistics = "all")
pred <- predict(gbm_iris, newdata = test, type = "prob")
roc <- evalm(data.frame(pred, test$class))
There are several problems in your code. I will use the PimaIndiansDiabetes
data set from mlbench
since it is better suited then the iris
data set.
First of all for splitting data into train and test sets the code:
train <- data %>% sample_frac(.70)
test <- data %>% sample_frac(.30)
is not suited since some rows occurring in the train set will also occur in the test set.
Additionally avoid to use function names as object names, it will save you much headache in the long run.
data(iris)
data <- as.data.frame(iris) #bad object name
To the example:
library(caret)
library(ModelMetrics)
library(dplyr)
library(mlbench)
data(PimaIndiansDiabetes, package = "mlbench")
Create train and test sets, you may use base R sample
to sample rows or caret::createDataPartition
. createDataPartition
is preferable since it tries to preserve the distribution of the response.
set.seed(123)
ind <- createDataPartition(PimaIndiansDiabetes$diabetes, 0.7)
tr <- PimaIndiansDiabetes[ind$Resample1,]
ts <- PimaIndiansDiabetes[-ind$Resample1,]
This way no rows in the train set will be in the test set.
Lets create the model:
ctrl <- trainControl(method = "cv",
number = 5,
returnResamp = 'none',
summaryFunction = twoClassSummary,
classProbs = T,
savePredictions = T,
verboseIter = F)
gbmGrid <- expand.grid(interaction.depth = 10,
n.trees = 200,
shrinkage = 0.01,
n.minobsinnode = 4)
set.seed(5627)
gbm_pima <- train(diabetes ~ .,
data = tr,
method = "gbm", #use xgboost
metric = "ROC",
tuneGrid = gbmGrid,
verbose = FALSE,
trControl = ctrl)
create a vector of probabilities for thresholder
probs <- seq(.1, 0.9, by = 0.02)
ths <- thresholder(gbm_pima,
threshold = probs,
final = TRUE,
statistics = "all")
head(ths)
Sensitivity Specificity Pos Pred Value Neg Pred Value Precision Recall F1 Prevalence Detection Rate Detection Prevalence
1 200 10 0.01 4 0.10 1.000 0.02222222 0.6562315 1.0000000 0.6562315 1.000 0.7924209 0.6510595 0.6510595 0.9922078
2 200 10 0.01 4 0.12 1.000 0.05213675 0.6633439 1.0000000 0.6633439 1.000 0.7975413 0.6510595 0.6510595 0.9817840
3 200 10 0.01 4 0.14 0.992 0.05954416 0.6633932 0.8666667 0.6633932 0.992 0.7949393 0.6510595 0.6458647 0.9739918
4 200 10 0.01 4 0.16 0.984 0.07435897 0.6654277 0.7936508 0.6654277 0.984 0.7936383 0.6510595 0.6406699 0.9636022
5 200 10 0.01 4 0.18 0.984 0.14188034 0.6821550 0.8750000 0.6821550 0.984 0.8053941 0.6510595 0.6406699 0.9401230
6 200 10 0.01 4 0.20 0.980 0.17179487 0.6886786 0.8833333 0.6886786 0.980 0.8086204 0.6510595 0.6380725 0.9271018
Balanced Accuracy Accuracy Kappa J Dist
1 0.5111111 0.6588517 0.02833828 0.02222222 0.9777778
2 0.5260684 0.6692755 0.06586592 0.05213675 0.9478632
3 0.5257721 0.6666781 0.06435166 0.05154416 0.9406357
4 0.5291795 0.6666781 0.07134190 0.05835897 0.9260250
5 0.5629402 0.6901572 0.15350721 0.12588034 0.8585308
6 0.5758974 0.6979836 0.18460584 0.15179487 0.8288729
extract the threshold probability based on your preferred metric
ths %>%
mutate(prob = probs) %>%
filter(J == max(J)) %>%
pull(prob) -> thresh_prob
thresh_prob
0.74
predict on test data
pred <- predict(gbm_pima, newdata = ts, type = "prob")
create a numeric response (0 or 1) based on the response in the test set since this is needed for the functions from package ModelMetrics
real <- as.numeric(factor(ts$diabetes))-1
ModelMetrics::sensitivity(real, pred$pos, cutoff = thresh_prob)
0.2238806 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::specificity(real, pred$pos, cutoff = thresh_prob)
0.956
ModelMetrics::kappa(real, pred$pos, cutoff = thresh_prob)
0.2144026 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::mcc(real, pred$pos, cutoff = thresh_prob)
0.2776309 #based on this it is clear the threshold chosen is not optimal on this test data
ModelMetrics::auc(real, pred$pos)
0.8047463 #decent AUC and low mcc and kappa indicate a poor choice of threshold
Auc is a measure over all thresholds so it does not require specification of the cutoff threshold.
Since only one train/test split is used the performance evaluation will be biased. Best is to use nested resampling so the same can be evaluated over several train/test splits. Here is a way to performed nested resampling.
EDIT: Answer to the questions in comments.
To create the roc curve you do not need to calculate sensitivity and specificity on all thresholds you can just use a specified package for such a task. The results are probability going to be more trustworthy.
I prefer using the pROC package:
library(pROC)
roc.obj <- roc(real, pred$pos)
plot(roc.obj, print.thres = "best")
The best threshold on the figure is the threshold that gives the highest specificity + sensitivity on the test data. It is clear that this threshold (0.289) is much lower compared to the threshold obtained based on cross validated predictions (0.74). This is the reason I said there will be considerable optimistic bias if you adjust the threshold on the cross-validated predictions and use thus obtained performance as an indicator of threshold success.
In the above example not tuning the threshold would have resulted in better performance on the test set. This might hold true in general for the Pima Indians data set or this might be a case of an unfortunate train/test split. So it is best to validate this sort of thing using nested resampling.