pythonvalidationmachine-learningscikit-learnscoring

MachineLearning cross_val_score vs cross_val_predict


While building a generic evaluation tool, I came upon the following problem, where the cross_val_score.mean() gives slightly different results than cross_val_predict.

For calculating the testing score I have the following code, which is calculating the score for each fold and then the mean of all.

testing_score = cross_val_score(clas_model, algo_features, algo_featurest, cv=folds).mean()

For calculating the tp, fp, tn, fn I have the following code, which is calculating these metrics for all folds (i suppose the sum).

test_clas_predictions = cross_val_predict(clas_model, algo_features, algo_featurest, cv=folds)
test_cm = confusion_matrix(algo_featurest, test_clas_predictions)
test_tp = test_cm[1][1]
test_fp = test_cm[0][1]
test_tn = test_cm[0][0]
test_fn = test_cm[1][0]

The outcome of this code is:

                         algo      test  test_tp  test_fp  test_tn  test_fn
5                  GaussianNB  0.719762       25       13      190       71
4          LogisticRegression  0.716429       24       13      190       72
2      DecisionTreeClassifier  0.702381       38       33      170       58
0  GradientBoostingClassifier  0.682619       37       36      167       59
3        KNeighborsClassifier  0.679048       36       36      167       60
1      RandomForestClassifier  0.675952       40       43      160       56

So picking the first line cross_val_score.mean() gave 0.719762 (test) and by calculating the score 25+190/25+13+190+71=0.719063545150... ((tp+tn)/(tp+tn+fp+fn)) which are slighty different.

I had the chance to read this from an article in quora: "In cross_val_predict() elements are grouped slightly different than in cross_val_score(). It means that when you will calculate the same metric using these functions, you can get different results."

Is there any particular reason behind this?


Solution

  • This is also called out in the documentation for cross_val_predict:

    Passing these predictions into an evaluation metric may not be a valid way to measure generalization performance. Results can differ from cross_validate and cross_val_score unless all tests sets have equal size and the metric decomposes over samples.

    It looks like in your case your metric is accuracy, which does decompose over samples. But it is possible (actually likely, because the total size is a not-highly-divisible 299) that your test folds are not of the same size, which could explain the very small (relative) difference in the two.