I have a dataset that has a multi-labeled data. There is a total of 20 labels (from 0 to 20) which has an imbalance distribution among them. Here is an overview of the data:
|id |label|value |
|-----|-----|------------|
|95534|0 |65.250002088|
|95535|18 | |
|95536|0 | |
|95536|0 |100 |
|95536|0 | |
|95536|0 |53.68547236 |
|95536|0 | |
|95537|1 | |
|95538|0 | |
|95538|0 | |
|95538|0 | |
|95538|0 |656.06155202|
|95538|0 | |
|95539|2 | |
|5935 |0 | |
|5935 |0 |150 |
|5935 |0 |50 |
|5935 |0 |24.610985335|
|5935 |0 | |
|5935 |0 |223.81789584|
|5935 |0 |148.1805218 |
|5935 |0 |110.9712538 |
|34147|19 |73.62651909 |
|34147|19 | |
|34147|19 |53.35958016 |
|34147|19 | |
|34147|19 | |
|34147|19 | |
|34147|19 |393.54029411|
I am looking to oversample the data and make a balance between the labels. I came across some methods like SMOTE
and SMOTENC
but they are all required splitting the data into train and test set and they are not working with sparse data. Is there any way that I can do this on the entire data in the pre-processing step before splitting?
To sample rows so that each label
is sampled with equal probability:
1/n_labels
l
should be 1/n_rows
for n_rows in that labelThe probability for each row is then p_row = 1/(n_labels*n_rows)
. You can generate these values with groupby and pass them to df.sample as follows:
import numpy as np
import pandas as pd
df_dict = {'id': {0: 95535, 1: 95536, 2: 95536, 3: 95536, 4: 95536, 5: 95536, 6: 95537, 7: 95538, 8: 95538, 9: 95538, 10: 95538, 11: 95538, 12: 95539, 13: 5935, 14: 5935, 15: 5935, 16: 5935, 17: 5935, 18: 5935, 19: 5935, 20: 5935, 21: 34147, 22: 34147, 23: 34147, 24: 34147, 25: 34147, 26: 34147, 27: 34147}, 'label': {0: 18, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 2, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 19, 22: 19, 23: 19, 24: 19, 25: 19, 26: 19, 27: 19}, 'value': {0: ' ', 1: ' ', 2: '100 ', 3: ' ', 4: '53.68547236 ', 5: ' ', 6: ' ', 7: ' ', 8: ' ', 9: ' ', 10: '656.06155202', 11: ' ', 12: ' ', 13: ' ', 14: '150 ', 15: '50 ', 16: '24.610985335', 17: ' ', 18: '223.81789584', 19: '148.1805218 ', 20: '110.9712538 ', 21: '73.62651909 ', 22: ' ', 23: '53.35958016 ', 24: ' ', 25: ' ', 26: ' ', 27: '393.54029411'}}
df = pd.DataFrame.from_dict(d)
# create column that includes counts by label
n_labels = df.label.nunique()
n_rows = df.groupby("label").id.transform("count")
weights = 1/(n_rows*n_labels)
# sanity check probabilities:
bool(np.sum(weights) == 1)
df_samples = df.sample(n=40000, weights=weights, replace=True, random_state=19)
verify that label draws are approximately uniform:
print(df_samples.label.value_counts()/len(df_samples))
# sampling frequency by group:
# 0 0.203325
# 2 0.201075
# 18 0.200925
# 19 0.198850
# 1 0.195825