I'm a newbie currently working on my project in Data Science Intro. In my project, I was required to tune hyperparameters using GridSearchCV to find the best K value for the KNN model. However, an issue confused my peers and me regarding whether to use the entire dataset (X, y) or just the training subset (X_train, y_train) when performing that process.
- Using the entire dataset: Some argue that utilising the entire dataset for gridsearchcv.fit(X, y) allows for maximising the data available for training, potentially leading to a more accurate determination of the best K value for the KNN model.
- Using only the training set: Others propose using only the training data for gridsearchcv.fit(X_train, y_train), arguing that this approach prevents data leakage from the unseen test set. Since GridSearchCV performs cross-validation, a test set should be reserved to evaluate the final model.
Personally, I tried to employ only the training set for GridSearchCV because I think it is necessary to keep the test data only for the final step. Could you clarify the issue and which approach is more advisable for tuning KNN hyperparameters with GridSearchCV? Thanks!
It has been already answered here and here. Hyperparameter tuning is also some sort of learning from data. So, you need to do it on training set only. Using entire dataset is wrong approach, since you will unable to evaluate your model performance on unseen data.