I've got a dataframe like that , and I want to oversample the column "role" (in a real case the number of rows/columns in much bigger than this minimal example)
role value
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 1 2
pop_13vdpn1_site_1 1 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_1 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 2
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_2 2 1
pop_13vdpn1_site_3 2 1
[...........]
Index: 20 entries, pop_13vdpn1_site_1 to pop_13vdpn1_site_1
Data columns (total 2 columns):
role 20 non-null int64
value 20 non-null int64
That's what I'm doing :
X,y = smote.fit_sample(df,df[['role']])
X
role value
0 1 1
1 1 1
2 1 2
3 1 1
4 1 1
5 1 2
6 1 1
7 2 1
8 2 1
[.........]
and it works, but the problem is that I need to keep the index (pop_13vdpn1_site_1, etc..) is that possible ?
Finally I've found a workaround (Maybe not optimal)
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_tmp = df.reset_index()
df_tmp['index'] = le.fit_transform(df_tmp['index'])
aa,bb = smote.fit_sample(df_tmp,df_tmp[['role']])
aa['index'] = le.inverse_transform(aa['index'])
aa.set_index('index')