I want to resample some big data (class sizes: 8mio vs 2700) I would like to have 50.000 samples of each by oversampling class 2 und undersampling class 1. imblearn seems to offer a cominbation of over- and undersampling but i dont get how it works.
from collections import Counter
from imblearn.over_sampling import SMOTENC
from imblearn.under_sampling import TomekLinks
from imblearn.combine import SMOTETomek
smt = SMOTETomek(random_state=1)
X_resamp, y_resamp = smt.fit_resample(data_all[29000:30000], labels_all[29000:30000])
Before the data looked like
>>Counter(labels_all[29000:30000])
>>Counter({0: 968, 9: 32})
and afterwards
>>Counter(y_resamp)
>>Counter({0: 968, 9: 968})
as I would expect or wish for something like
>>Counter(y_resamp)
>>Counter({0: 100, 9: 100})
It seems you only have 32 records with class 9
, so it over sample that class and align it's data records with those of class 0
hence 9: 968
you are talking about reduce the data set to 100 record, you can do it be sampling 100 records randomly for each class, from X
and Y
(same 100 records) or take the first 100 like y_resamp[:100]