pythonmachine-learningoversamplingimblearnimbalanced-data

How to use combination of over- and undersampling? with imbalanced learn


I want to resample some big data (class sizes: 8mio vs 2700) I would like to have 50.000 samples of each by oversampling class 2 und undersampling class 1. imblearn seems to offer a cominbation of over- and undersampling but i dont get how it works.

from collections import Counter
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek

smt = SMOTETomek(random_state=1)
X_resamp, y_resamp = smt.fit_resample(data_all[29000:30000], labels_all[29000:30000])

Before the data looked like

>>Counter(labels_all[29000:30000])
>>Counter({0: 968, 9: 32})

and afterwards

>>Counter(y_resamp)
>>Counter({0: 968, 9: 968})

as I would expect or wish for something like

>>Counter(y_resamp)
>>Counter({0: 100, 9: 100})

Solution

  • It seems you only have 32 records with class 9, so it over sample that class and align it's data records with those of class 0 hence 9: 968

    you are talking about reduce the data set to 100 record, you can do it be sampling 100 records randomly for each class, from X and Y (same 100 records) or take the first 100 like y_resamp[:100]