[SOLVED] Scaling and data leakage on cross validation and test set

Scaling and data leakage on cross validation and test set

I have more of a best practice question.

I am scaling my data and I understand that I should fit_transform on my training set and transform on my test set because of potential data leakage.

Now if I want to use both (5 fold) Cross validation on my training set but I use a holdout test set anyway is it necessary to scale each fold independently?

My problem is that I want to use Feature Selection like this:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS

scaler = MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

efs = EFS(clf_tmp, 
        min_features=min,
        max_features=max,
        cv=5,
        n_jobs = n_jobs)

efs = efs.fit(X_train, y_train)

Right now I am scaling X_train and X_test independently. But when the whole training set goes into the feature selector there will be some data leakage. Is this a problem for evaluation?

Solution

It's definitely best practice to include everything within your cross-validation loop to avoid data leakage. Any scaling should be done on the training set and then applied to the test set within each CV loop.