pythonpandastrain-test-splitimbalanced-dataimblearn

How to combine X_train and y_train into one balanced dataframe?


I have imbalanced dataset: y has only 2% of 1. I want to balance only the train dataset and afterwards to perform on the balanced train dataset feature selection prior to the model.

After performing the split and balance I need to combine X_train and y_train into one df.

What would be the correct way to do it, while making sure the "y"s got merged with the counterparts Xs?

  1. I performed the test_train_dev split and maintained the y stratification of 2% in each dataset like this:
X_temp, X_test, y_temp, y_test = train_test_split(X, y, shuffle=True,test_size=0.2, random_state=5,stratify=y)

X_train, X_dev, y_train, y_dev = train_test_split(X_temp, y_temp, shuffle=True,test_size=0.10, random_state=8,stratify=y_temp)
  1. Then I balanced only the train dataset like this:
smt = SMOTEENN(random_state=122)
X_train, y_train = smt.fit_resample(X_train, y_train)

#Check the balancing
y_train["lung_cancer"].value_counts()

1    99697
0    88464

  1. Now I would like to combine X_train and y_train into one dataframe in order to perform feature selection. How can I make sure that each y will indeed be merged to the correct row of X?

P.S I have removed the ID ('plco_id') while defining X, is there any way I could keep it during the split and the balancing in the X and the y? How?

X = df2.loc[:, ~df2.columns.isin(['lung_cancer', 'plco_id'])]
y = df2.iloc[:, [1]]

Solution

  • How can I make sure that each y will indeed be merged to the correct row of X?

    The order is not changed, so you can just concatenate them like this: train_df = pd.concat([X_train, y_train], axis=1). Just imagine: If the order didn't stay the same, how should a classifier know which row in X belongs to which element in y?

    is there any way I could keep it during the split and the balancing in the X and the y? How?

    You can set it as an index like this:

    df2.set_index("plco_id", inplace=True)
    X = df2.loc[:, ~df2.columns.isin(['lung_cancer'])]
    y = df2.iloc[:, [0]]