I created a python for loop to split the training dataset into stratified KFolds and used a classifier inside the loop to train it. Then used the trained model to predict with the validation data. The metrics achieved using this process where quite different to that achieved with the cross_val_score function. I expected the same results using both methods.
This code is for text classification and I use TF-IDF to vectorize the text
Code for manual implementation of cross validation:
#Importing metrics functions to measure performance of a model
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.model_selection import StratifiedKFold
data_validation = [] # list used to store the results of model validation using cross validation
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
accuracy_val = []
f1_val = []
# use ravel function to flatten the multi-dimensional array to a single dimension
for train_index, val_index in (skf.split(X_train, y_train)):
X_tr, X_val = X_train.ravel()[train_index], X_train.ravel()[val_index]
y_tr, y_val = y_train.ravel()[train_index] , y_train.ravel()[val_index]
tfidf=TfidfVectorizer()
X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold
#instantiate model
model= MultinomialNB(alpha=0.5, fit_prior=False)
#Training the empty model with our training dataset
model.fit(X_tr_vec_tfidf, y_tr)
predictions_val = model.predict(X_val_vec_tfidf) # make predictions with the validation dataset
acc_val = accuracy_score(y_val, predictions_val)
accuracy_val.append(acc_val)
f_val=f1_score(y_val, predictions_val)
f1_val.append(f_val)
avg_accuracy_val = np.mean(accuracy_val)
avg_f1_val = np.mean(f1_val)
# temp list to store the metrics
temp = ['NaiveBayes']
temp.append(avg_accuracy_val) #validation accuracy score
temp.append(avg_f1_val) #validation f1 score
data_validation.append(temp)
#Create a table ,using dataframe, which contains the metrics for all the trained and tested ML models
result = pd.DataFrame(data_validation, columns = ['Algorithm','Accuracy Score : Validation','F1-Score : Validation'])
result.reset_index(drop=True, inplace=True)
result
Output:
Algorithm Accuracy Score : Validation F1-Score : Validation
0 NaiveBayes 0.77012 0.733994
Now code to use cross_val_score function:
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
scores = ['accuracy', 'f1']
#Text vectorization of training and testing datasets using NLP technique TF-IDF
tfidf=TfidfVectorizer()
X_tr_vec_tfidf = tfidf.fit_transform(X_train)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
nb=MultinomialNB(alpha=0.5, fit_prior=False)
for score in ["accuracy", "f1"]:
print (f'{score}: {cross_val_score(nb,X_tr_vec_tfidf,y_train,cv=skf,scoring=score).mean()} ')
Output:
accuracy: 0.7341283583255231
f1: 0.7062017090972422
As can be seen the accuracy and f1 metrics are quite different using the two methods. The difference in metrics is much worse when I use the KNeighborsClassfier.
TL;DR: The two ways of calculation are not equivalent due to the different way you handle the TF-IDF transformation; the first calculation is the correct one.
In the first calculation you correctly apply fit_transform
only to the training data of each fold, and transform
to the validation data fold:
X_tr_vec_tfidf = tfidf.fit_transform(X_tr) # vectorize the training folds
X_val_vec_tfidf = tfidf.transform(X_val) # vectorize the validation fold
But in the second calculation you do not do that; instead, you apply fit_transform
to the whole of the training data, before it is split to training and validation folds:
X_tr_vec_tfidf = tfidf.fit_transform(X_train)
hence the difference. The fact that you seem to get a better accuracy with the second, wrong way of calculation, is due to information leakage (your validation data is not actually unseen, they have participated in the TF-IDF transformation).
The correct way to use cross_val_score
when we have transformations is via a pipeline (API, User's Guide):
from sklearn.pipeline import Pipeline
tfidf = TfidfVectorizer()
nb = MultinomialNB(alpha=0.5, fit_prior=False)
pipeline = Pipeline([('transformer', tfidf), ('estimator', nb)])
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X_train, y_train, cv = skf)