While building a generic evaluation tool, I came upon the following problem, where the cross_val_score.mean() gives slightly different results than cross_val_predict.
For calculating the testing score I have the following code, which is calculating the score for each fold and then the mean of all.
testing_score = cross_val_score(clas_model, algo_features, algo_featurest, cv=folds).mean()
For calculating the tp, fp, tn, fn I have the following code, which is calculating these metrics for all folds (i suppose the sum).
test_clas_predictions = cross_val_predict(clas_model, algo_features, algo_featurest, cv=folds)
test_cm = confusion_matrix(algo_featurest, test_clas_predictions)
test_tp = test_cm[1][1]
test_fp = test_cm[0][1]
test_tn = test_cm[0][0]
test_fn = test_cm[1][0]
The outcome of this code is:
algo test test_tp test_fp test_tn test_fn
5 GaussianNB 0.719762 25 13 190 71
4 LogisticRegression 0.716429 24 13 190 72
2 DecisionTreeClassifier 0.702381 38 33 170 58
0 GradientBoostingClassifier 0.682619 37 36 167 59
3 KNeighborsClassifier 0.679048 36 36 167 60
1 RandomForestClassifier 0.675952 40 43 160 56
So picking the first line cross_val_score.mean() gave 0.719762 (test) and by calculating the score 25+190/25+13+190+71=0.719063545150... ((tp+tn)/(tp+tn+fp+fn)) which are slighty different.
I had the chance to read this from an article in quora: "In cross_val_predict() elements are grouped slightly different than in cross_val_score(). It means that when you will calculate the same metric using these functions, you can get different results."
Is there any particular reason behind this?
This is also called out in the documentation for cross_val_predict
:
Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from
cross_validate
andcross_val_score
unless all tests sets have equal size and the metric decomposes over samples.
It looks like in your case your metric is accuracy, which does decompose over samples. But it is possible (actually likely, because the total size is a not-highly-divisible 299) that your test folds are not of the same size, which could explain the very small (relative) difference in the two.