machine-learningimbalanced-dataimblearnsmote

Identify the Synthetic Samples generated by SMOTE


I have a labeled dataset with X shape being 7000 x 2400 and y shape being 7000. The data is heavily imbalanced, so I am trying to generate synthetic samples using SMOTE. However I want to identify the synthetic samples that SMOTE actually generated. As an example , here's a code snippet:

import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from imblearn.over_sampling import SMOTE

iris = load_iris()

X = iris['data']
y = iris['target']

#The data is balanced , so I intentionally remove some samples
X = X[:125,::]
y = y[:125]

oversample = SMOTE()
X_smt, y_smt = oversample.fit_resample(X, y)

The arrays X_smt and y_smt have both the original samples and the synthetic samples. Is there a simple way to identify the synthetic samples by index or some other mechanism ?


Solution

  • I really feel stupid .... the answer is that simple. It seems SMOTE just appends the new samples after the original samples. Just adding these two lines proves my point.

    for i in range(X_smt.shape[0]):
      print(any(np.array_equal(X_smt[i],j) for j in X),i)
    

    What we are doing is to find each element of X_smt in X. Since X has 125 elements (0 to 124), each of the first 125 elements of X_smt should be found in X. Whereas elements indexed from 125 onwards shouldn't be there in X. The print statement proves it. Feel free to run the notebook here