I have a large dataset and I have split it in:
On each set, I performed missing values imputation and feature selection (trained on the training set, and replicated into validation and test set) to avoid data leakage.
Now, I want to train an XGBoost model in Python and want to perform hyperparameter-tuning using the training set, and evaluate each parameter set with the validation set. How can I do this using a random approach such as in RandomizedSearchCV, so that I don't run all the parameter sets?
If I am correct, GridSearch and RandomizedSearchCV allow only cross-validation, which is not what I want, because splitting the preprocessed training set in folds will result in data leakage. I know I could build a sklearn pipeline where I do the preprocessing in each fold, but I would like to avoid the latter option.
I can only think about this code that runs each parameter set like in GridSearch:
from sklearn.model_selection import ParameterGrid
import xgboost as xgb
# Define your hyperparameter grid
param_grid = {
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.1, 0.2],
'n_estimators': [100, 200, 300]
}
best_score = -1
best_params = {}
for params in ParameterGrid(param_grid):
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train)
val_score = model.score(X_val, y_val) # Or use a more specific metric
if val_score > best_score:
best_score = val_score
best_params = params
# Train the final model with the best hyperparameters
best_model = xgb.XGBClassifier(**best_params)
best_model.fit(X_train, y_train)
I understand your problem and tbh I don't have a clear cut answer. However, using the same validation set repeatedly like you propose is also not desired since you risk (over)fitting your hyperparams on that specific part of the data.
You could pre-specify the folds and impute the validation fold based on the other training folds. You could also accept that there is a small chance of leakage due to one or two observations ending up in the validation fold. Of course, this totally depends on your data and method of imputation. Wildly varying performance across the CV folds is an indicator that leakage is a problem.