I have more of a best practice question.
I am scaling my data and I understand that I should fit_transform on my training set and transform on my test set because of potential data leakage.
Now if I want to use both (5 fold) Cross validation on my training set but I use a holdout test set anyway is it necessary to scale each fold independently?
My problem is that I want to use Feature Selection like this:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS
scaler = MinMaxScaler()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
efs = EFS(clf_tmp,
min_features=min,
max_features=max,
cv=5,
n_jobs = n_jobs)
efs = efs.fit(X_train, y_train)
Right now I am scaling X_train and X_test independently. But when the whole training set goes into the feature selector there will be some data leakage. Is this a problem for evaluation?
It's definitely best practice to include everything within your cross-validation loop to avoid data leakage. Any scaling should be done on the training set and then applied to the test set within each CV loop.