Hi all I am having trouble understanding how to use the output of sklearn.calibration.CalibratedClassifierCV
.
I have calibrated my binary classifier using this method, and results are greatly improved. However I am not sure how to interpret the results. sklearn guide states that, after calibration,
the output of
predict_proba
method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to 0.8, approximately 80% actually belong to the positive class.
Now I would like to reduce false positive by applying a cutoff at .6 for the model to predict label True
. Without the calibration, I would have simply used my_model.predict_proba() > .6
.
However, it seems that after calibration the meaning of predict_proba has changed, so I am not sure if I can do that anymore.
From a quick testing it seems that predict and predict_proba follow the same logic I would expect before calibration. The output of:
pred = my_model.predict(valid_x)
proba= my_model.predict_proba(valid_x)
pd.DataFrame({"label": pred, "proba": proba[:,1]})
Where everything that has a probability of above .5 gets to be classifed as True, and everything below .5 as False.
Can you confirm that, after calibration, I can still use predict_proba
to apply a different cutoff to identify my labels?
2 https://scikit-learn.org/stable/modules/calibration.html#calibration
For me, you can actually use predict_proba()
after calibration to apply a different cutoff.
What happens within class CalibratedClassifierCV
(as you noticed) is effectively that the output of predict()
is based on the output of predict_proba()
(see here for reference), i.e. np.argmax(self.predict_proba(X), axis=1) == self.predict(X)
.
On the other side, for the non-calibrated classifier that you're passing to CalibratedClassifierCV
(depending on whether it is a probabilistic classifier or not) the above equality may or may not hold (e.g. it does not for an SVC()
classifier - see here, for instance, for some other details on this).