pythonpandasimbalanced-dataoversamplingsmote

How to keep/extend index when oversample


I've got a dataframe like that , and I want to oversample the column "role" (in a real case the number of rows/columns in much bigger than this minimal example)

                 role  value
pop_13vdpn1_site_1  1   1
pop_13vdpn1_site_1  1   1
pop_13vdpn1_site_1  1   2
pop_13vdpn1_site_1  1   1
pop_13vdpn1_site_1  1   1
pop_13vdpn1_site_1  1   2
pop_13vdpn1_site_1  1   1
pop_13vdpn1_site_1  2   1
pop_13vdpn1_site_1  2   1
pop_13vdpn1_site_1  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   2
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_2  2   1
pop_13vdpn1_site_3  2   1
[...........]

Index: 20 entries, pop_13vdpn1_site_1 to pop_13vdpn1_site_1
Data columns (total 2 columns):
role     20 non-null int64
value    20 non-null int64

That's what I'm doing :

X,y = smote.fit_sample(df,df[['role']])
X
       role value
0   1   1
1   1   1
2   1   2
3   1   1
4   1   1
5   1   2
6   1   1
7   2   1
8   2   1
[.........]

and it works, but the problem is that I need to keep the index (pop_13vdpn1_site_1, etc..) is that possible ?


Solution

  • Finally I've found a workaround (Maybe not optimal)

    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    df_tmp = df.reset_index()
    df_tmp['index'] = le.fit_transform(df_tmp['index'])
    aa,bb = smote.fit_sample(df_tmp,df_tmp[['role']])
    aa['index'] = le.inverse_transform(aa['index'])
    aa.set_index('index')