When using a classifier like GaussianNB(), the resulting .predict_proba()
values are sometimes poorly calibrated; that's why I'd like to wrap this classifier into sklearn's CalibratedClassifierCV.
I have now a binary classification problem with only a very few positive samples - so few that CalibratedClassifierCV fails because there are less samples than folds (the resulting error is then Requesting 5-fold cross-validation but provided less than 5 examples for at least one class.
). Thus, I'd like to upsample the minority class before applying the classifier. I use imblearn
's pipeline for this as it ensures that resampling takes place only during fit and not during inference.
However, I do not find a way to upsample my training data and combine it with CalibratedClassifierCV while ensuring that upsampling only takes place during fit and not during inference.
I tried the following reproducible example, but it seems that CalibratedClassifierCV wants to split the data first, prior to upsampling - and it fails.
Is there a way to correctly upsample data while using CalibratedClassifierCV?
from sklearn.calibration import CalibratedClassifierCV
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import Pipeline
X, y = make_classification(
n_samples = 100,
n_features = 10,
n_classes = 2,
weights = (0.95,), # 5% of samples are of class 1
random_state = 10,
shuffle = True
)
X_train, X_val, y_train, y_val = train_test_split(
X,
y,
test_size = 0.2,
random_state = 10,
shuffle = True,
stratify = y
)
pipeline = Pipeline([
("resampling", RandomOverSampler(
sampling_strategy=0.2,
random_state=10
)),
("model", GaussianNB())
])
m = CalibratedClassifierCV(
base_estimator=pipeline,
method="isotonic",
cv=5,
n_jobs=-1
)
m.fit(X_train, y_train) # results in error
I guess I understand my conceptual error: the cross-validation split has to happen BEFORE upsampling and not after (otherwise there would be information leakage from validation to training). But if it happens before, I cannot have more folds than samples of the positive class... Thus, oversampling does not save me from having not enough samples for CalibratedClassifierCV. So I indeed have to reduce the number of folds, as @NMH1013 suggests.