I have a labeled dataset with X shape being 7000 x 2400 and y shape being 7000. The data is heavily imbalanced, so I am trying to generate synthetic samples using SMOTE. However I want to identify the synthetic samples that SMOTE actually generated. As an example , here's a code snippet:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
from imblearn.over_sampling import SMOTE
iris = load_iris()
X = iris['data']
y = iris['target']
#The data is balanced , so I intentionally remove some samples
X = X[:125,::]
y = y[:125]
oversample = SMOTE()
X_smt, y_smt = oversample.fit_resample(X, y)
The arrays X_smt and y_smt have both the original samples and the synthetic samples. Is there a simple way to identify the synthetic samples by index or some other mechanism ?
I really feel stupid .... the answer is that simple. It seems SMOTE just appends the new samples after the original samples. Just adding these two lines proves my point.
for i in range(X_smt.shape[0]):
print(any(np.array_equal(X_smt[i],j) for j in X),i)
What we are doing is to find each element of X_smt in X. Since X has 125 elements (0 to 124), each of the first 125 elements of X_smt should be found in X. Whereas elements indexed from 125 onwards shouldn't be there in X. The print statement proves it. Feel free to run the notebook here