I have imbalanced dataset: y has only 2% of 1. I want to balance only the train dataset and afterwards to perform on the balanced train dataset feature selection prior to the model.
After performing the split and balance I need to combine X_train and y_train into one df.
What would be the correct way to do it, while making sure the "y"s got merged with the counterparts Xs?
X_temp, X_test, y_temp, y_test = train_test_split(X, y, shuffle=True,test_size=0.2, random_state=5,stratify=y)
X_train, X_dev, y_train, y_dev = train_test_split(X_temp, y_temp, shuffle=True,test_size=0.10, random_state=8,stratify=y_temp)
smt = SMOTEENN(random_state=122)
X_train, y_train = smt.fit_resample(X_train, y_train)
#Check the balancing
y_train["lung_cancer"].value_counts()
1 99697
0 88464
P.S I have removed the ID ('plco_id') while defining X, is there any way I could keep it during the split and the balancing in the X and the y? How?
X = df2.loc[:, ~df2.columns.isin(['lung_cancer', 'plco_id'])]
y = df2.iloc[:, [1]]
How can I make sure that each y will indeed be merged to the correct row of X?
The order is not changed, so you can just concatenate them like this: train_df = pd.concat([X_train, y_train], axis=1)
. Just imagine: If the order didn't stay the same, how should a classifier know which row in X belongs to which element in y?
is there any way I could keep it during the split and the balancing in the X and the y? How?
You can set it as an index like this:
df2.set_index("plco_id", inplace=True)
X = df2.loc[:, ~df2.columns.isin(['lung_cancer'])]
y = df2.iloc[:, [0]]