
K-fold to train a machine learning model

I have a big question today for which I can't figure out the real solution. I make a stratified K-folding to my gridsearch (to search the good hyperparameter for my ML model). Can I take the best fold for training my final model. For example, there is an example of my code with logistic regression :

from sklearn.metrics import confusion_matrix, roc_auc_score ,roc_curve,auc
di = {}
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
pred_test_full =0
cv_score =[]
for train_index,test_index in kf.split(X,y):
    print('{} of KFold {}'.format(i,kf.n_splits))
    X_train,X_test = X.loc[train_index],X.loc[test_index]
    y_train,y_test = y.loc[train_index],y.loc[test_index]
    di[i]={'X_train': X_train,
    lr = LogisticRegression(C=0.009,penalty='l2', solver= 'newton-cg').fit(X_train,y_train)
    precision, recall, thresholds = precision_recall_curve(y_test, lr.predict_proba(X_test)[:,1])
    # Use AUC function to calculate the area under the curve of precision recall curve
    auc_precision_recall = metrics.auc(recall, precision)
    print('ROC AUC score :',score)
    print('auc_precision_recall :',auc_precision_recall)
    pred_test = lr.predict_proba(X_test)[:,1]
    pred_test_full +=pred_test

# In this case the fold 4 have the best results, so I take it to train my final model :
o = 4 

# fitting of the final model
lr = LogisticRegression(C=0.009,penalty='l2', solver= 'newton-cg').fit(X_train,y_train)

I don't know if doing this it's "legal" or totally forbidden.

Thank you for your lighting !


  • No, you cannot do that at all :) This is similar to choosing your dataset for a better final score. You can take the scores of each folds and average them (and calculate the standard deviation) or use cross-val-score. There are multiple answers and articles about this, like here or here.