[SOLVED] How to make calibration plots from predictions of binary outcome?

How to make calibration plots from predictions of binary outcome?

I have made several models (RF, XGB and GLM) to predict a binary outcome and they all achieved an AUC of approximately 0.8 and a brier score of 0.1-0.15.

I am trying to create calibration plots and I am getting results that I don't understand.

My questions pertaining to programming:

Is the below syntax correct to obtain a calibration plot for my model?
Can anyone with experience in this field see from the plot whether it is just a poor model or obviously faulty programming?
How would experienced users create a calibration plot for a prediction model for a binary categorical outcome?

I don't know if it is because of feeding the wrong type of data to the scripts for the plots or if the models are very poorly calibrated or if the outcome is too rare/ data too small. Test set is ~400 subjects, ~40 have the outcome.

For the RF model this is the syntax I have used:

RF_model <- randomForest(outcome ~ ., data = TRAIN_data)

RF_prediction$pred <- predict(RF_model, TEST_data, type = "prob")[,"no"]

and for the calibration plot(with the "probably" package):

RF_prediction %>% cal_plot_breaks (outcome, pred)

As shown in the picture, the plot takes a hard dive after the midpoint and I can't wrap my head around it (I am not a mathematician/ data scientist, but an MD doing some research). My guess is that something is labeled wrong or that the predicted output has the wrong format somehow?

Outcome is a factor but I can only make plots work when its an integer (0's and 1's!). Predictions are numbers (0.020 0.004 0.000 0.008 0.000 0.026 0.044 0.002 0.002 0.002 0.050 0.002 0.018 0.048 etc)

I have tried several other packages but end up getting variations on very similar looking plots regardless of package. I have tried predtools, caret, classifierplots and runway so as you can imagine I need some help at this point!

Any help would be greatly appreciated!

Solution

Your issue just seems to be small samples in the validation set - you can see the confidence intervals for your last bin (50-60%) go from 0% to ~60% event rate so your results aren't out of the ordinary just highly uncertain, probably due to low volume of data in that bin