pythonmachine-learningtraining-datagrid-searchk-fold

K-fold to train a machine learning model


I have a big question today for which I can't figure out the real solution. I make a stratified K-folding to my gridsearch (to search the good hyperparameter for my ML model). Can I take the best fold for training my final model. For example, there is an example of my code with logistic regression :

from sklearn.metrics import confusion_matrix, roc_auc_score ,roc_curve,auc
di = {}
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
pred_test_full =0
cv_score =[]
i=1
for train_index,test_index in kf.split(X,y):
    print('{} of KFold {}'.format(i,kf.n_splits))
    X_train,X_test = X.loc[train_index],X.loc[test_index]
    y_train,y_test = y.loc[train_index],y.loc[test_index]
    
    
    di[i]={'X_train': X_train,
           'X_test':X_test,
           'y_train':y_train,
           'y_test':y_test}
    #model
    lr = LogisticRegression(C=0.009,penalty='l2', solver= 'newton-cg').fit(X_train,y_train)
    score=roc_auc_score(y_test,lr.predict_proba(X_test)[:,1])
    precision, recall, thresholds = precision_recall_curve(y_test, lr.predict_proba(X_test)[:,1])
    # Use AUC function to calculate the area under the curve of precision recall curve
    auc_precision_recall = metrics.auc(recall, precision)
    print('ROC AUC score :',score)
    print('auc_precision_recall :',auc_precision_recall)
    cv_score.append(score)    
    pred_test = lr.predict_proba(X_test)[:,1]
    pred_test_full +=pred_test
    i+=1

# In this case the fold 4 have the best results, so I take it to train my final model :
o = 4 
X_train=di[o]['X_train']
X_test=di[o]['X_test']
y_train=di[o]['y_train']
y_test=di[o]['y_test']

# fitting of the final model
lr = LogisticRegression(C=0.009,penalty='l2', solver= 'newton-cg').fit(X_train,y_train)

I don't know if doing this it's "legal" or totally forbidden.

Thank you for your lighting !


Solution

  • No, you cannot do that at all :) This is similar to choosing your dataset for a better final score. You can take the scores of each folds and average them (and calculate the standard deviation) or use cross-val-score. There are multiple answers and articles about this, like here or here.