I have a big question today for which I can't figure out the real solution. I make a stratified K-folding to my gridsearch (to search the good hyperparameter for my ML model). Can I take the best fold for training my final model. For example, there is an example of my code with logistic regression :
from sklearn.metrics import confusion_matrix, roc_auc_score ,roc_curve,auc
di = {}
kf = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
pred_test_full =0
cv_score =[]
i=1
for train_index,test_index in kf.split(X,y):
print('{} of KFold {}'.format(i,kf.n_splits))
X_train,X_test = X.loc[train_index],X.loc[test_index]
y_train,y_test = y.loc[train_index],y.loc[test_index]
di[i]={'X_train': X_train,
'X_test':X_test,
'y_train':y_train,
'y_test':y_test}
#model
lr = LogisticRegression(C=0.009,penalty='l2', solver= 'newton-cg').fit(X_train,y_train)
score=roc_auc_score(y_test,lr.predict_proba(X_test)[:,1])
precision, recall, thresholds = precision_recall_curve(y_test, lr.predict_proba(X_test)[:,1])
# Use AUC function to calculate the area under the curve of precision recall curve
auc_precision_recall = metrics.auc(recall, precision)
print('ROC AUC score :',score)
print('auc_precision_recall :',auc_precision_recall)
cv_score.append(score)
pred_test = lr.predict_proba(X_test)[:,1]
pred_test_full +=pred_test
i+=1
# In this case the fold 4 have the best results, so I take it to train my final model :
o = 4
X_train=di[o]['X_train']
X_test=di[o]['X_test']
y_train=di[o]['y_train']
y_test=di[o]['y_test']
# fitting of the final model
lr = LogisticRegression(C=0.009,penalty='l2', solver= 'newton-cg').fit(X_train,y_train)
I don't know if doing this it's "legal" or totally forbidden.
Thank you for your lighting !
No, you cannot do that at all :) This is similar to choosing your dataset for a better final score. You can take the scores of each folds and average them (and calculate the standard deviation) or use cross-val-score. There are multiple answers and articles about this, like here or here.