machine-learning scikit-learn evaluation hyperparameters

Hyperparameter tuning and model evaluation in scikit-learn

I feel a bit confused about how to use together hyperparameter tuning and model evaluation correctly.

Should hyperparameter tuning be done on the whole dataset or only on the training set? What is the correct sequence of actions?

Could you please review my code and advise me the best practice considering the issue?

Here I am using hyperparameter tuning on the whole dataset first and then evaluate the model performance only on the train set. Is it correct? Doesn't it lead to data leakage?

Hyperparameter Tuning

numeric_features = X.select_dtypes(include=['int', 'float']).columns
categorical_features = X.select_dtypes(include=['object', 'category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1),
                     alphas = np.arange(0, 1.1, 0.1),
                     random_state=818,
                     n_jobs = -1)

model = make_pipeline(preprocessor, en_cv)
model.fit(X, y)

best_alpha = en_cv.alpha_
best_l1_ratio = en_cv.l1_ratio_

Model evaluation:

ElasticNet = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=l1_ratio))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818)

ElasticNet.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(r2, mse)

This code, actually, took about 18 minutes to run on the dataset with about 80000 observations and about 150 columns. Is this considered adequate?

Solution

Concerning the hyperparameter tuning question, it should always be done on the training set, not the whole dataset. Tuning hyperparameters on the entire dataset introduces what we call "data leakage", where information from the test set (which should be unseen) influences the model training process. If you do this then you will be getting "leaked/too good" performance estimates.

Concerning the second question, a baseline pipeline would look like this:

Split your dataset into training and testing sets.
Perform hyperparameter tuning only on the training set.
After determining the best hyperparameters, retrain your model using these hyperparameters on the training set.
Evaluate the model performance on the test set to get an unbiased estimate of its generalization ability.

Your posted code should become based on the above:

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.metrics import r2_score, mean_squared_error

# Split the data into training and testing sets first
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=818)

numeric_features = X_train.select_dtypes(include=['int', 'float']).columns
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

# Perform hyperparameter tuning on the training set
en_cv = ElasticNetCV(l1_ratio=np.arange(0, 1.1, 0.1), alphas=np.arange(0, 1.1, 0.1), random_state=818, n_jobs=-1)
model = make_pipeline(preprocessor, en_cv)
model.fit(X_train, y_train)  # Use only training data here

# Get the best hyperparameters
best_alpha = en_cv.alpha_
best_l1_ratio = en_cv.l1_ratio_

# Train a new model on the training data using the best hyperparameters
elastic_net_model = make_pipeline(preprocessor, ElasticNet(alpha=best_alpha, l1_ratio=best_l1_ratio))
elastic_net_model.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = elastic_net_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)

print(f"R^2: {r2}, MSE: {mse}")