resamplingcaretstatistics-bootstrapimbalanced-data

Creating balanced bootstrap resamples in caret


I'm using caret to compare models for a classification problem with nested CV. Vfold in the outer loop and bootstrap (500 replicates) in the inner loop. I get this error after training knn:

Warning: There were missing values in resampled performance measures.

Which I believe comes from the fact that some resamples have zero items of the class of interest in the holdout sample, yielding NA for Sensitivity and ROC. My question is: Is there any way to ensure that items from this class are present in every bootstrap resample? Kind of what the CreateDataPartition function does (I believe this is also called stratified bootstrap?).

If not, how should we proceed with this? (In terms of comparing model performance on the same resamples)

Thanks!


Solution

  • So I couldn't find a way to do this within caret but here is a workaround using rsample package. The point is to compute the resamples before and feed this information to trainControl function via index and indexOut arguments, previous conversion to caret format.

    indices=bootstraps(train,times=50,strata="class_of_interest")
    indices=rsample2caret(indices)
    train_control <- trainControl(method="boot",number=50,index=indices$index,indexOut = indices$indexOut)
    

    Hope this helps.