pythonpandasdataframeoversamplingsmote

Oversampling a sparse dataset in Python


I have a dataset that has a multi-labeled data. There is a total of 20 labels (from 0 to 20) which has an imbalance distribution among them. Here is an overview of the data:

|id   |label|value       |
|-----|-----|------------|
|95534|0    |65.250002088|
|95535|18   |            |
|95536|0    |            |
|95536|0    |100         |
|95536|0    |            |
|95536|0    |53.68547236 |
|95536|0    |            |
|95537|1    |            |
|95538|0    |            |
|95538|0    |            |
|95538|0    |            |
|95538|0    |656.06155202|
|95538|0    |            |
|95539|2    |            |
|5935 |0    |            |
|5935 |0    |150         |
|5935 |0    |50          |
|5935 |0    |24.610985335|
|5935 |0    |            |
|5935 |0    |223.81789584|
|5935 |0    |148.1805218 |
|5935 |0    |110.9712538 |
|34147|19   |73.62651909 |
|34147|19   |            |
|34147|19   |53.35958016 |
|34147|19   |            |
|34147|19   |            |
|34147|19   |            |
|34147|19   |393.54029411|

I am looking to oversample the data and make a balance between the labels. I came across some methods like SMOTE and SMOTENC but they are all required splitting the data into train and test set and they are not working with sparse data. Is there any way that I can do this on the entire data in the pre-processing step before splitting?


Solution

  • To sample rows so that each label is sampled with equal probability:

    The probability for each row is then p_row = 1/(n_labels*n_rows). You can generate these values with groupby and pass them to df.sample as follows:

    import numpy as np
    import pandas as pd
    
    df_dict = {'id': {0: 95535, 1: 95536, 2: 95536, 3: 95536, 4: 95536, 5: 95536, 6: 95537, 7: 95538, 8: 95538, 9: 95538, 10: 95538, 11: 95538, 12: 95539, 13: 5935, 14: 5935, 15: 5935, 16: 5935, 17: 5935, 18: 5935, 19: 5935, 20: 5935, 21: 34147, 22: 34147, 23: 34147, 24: 34147, 25: 34147, 26: 34147, 27: 34147}, 'label': {0: 18, 1: 0, 2: 0, 3: 0, 4: 0, 5: 0, 6: 1, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 2, 13: 0, 14: 0, 15: 0, 16: 0, 17: 0, 18: 0, 19: 0, 20: 0, 21: 19, 22: 19, 23: 19, 24: 19, 25: 19, 26: 19, 27: 19}, 'value': {0: '            ', 1: '            ', 2: '100         ', 3: '            ', 4: '53.68547236 ', 5: '            ', 6: '            ', 7: '            ', 8: '            ', 9: '            ', 10: '656.06155202', 11: '            ', 12: '            ', 13: '            ', 14: '150         ', 15: '50          ', 16: '24.610985335', 17: '            ', 18: '223.81789584', 19: '148.1805218 ', 20: '110.9712538 ', 21: '73.62651909 ', 22: '            ', 23: '53.35958016 ', 24: '            ', 25: '            ', 26: '            ', 27: '393.54029411'}}    
    
    df = pd.DataFrame.from_dict(d)
    
    # create column that includes counts by label
    n_labels = df.label.nunique()
    n_rows = df.groupby("label").id.transform("count")
    weights = 1/(n_rows*n_labels)
    
    # sanity check probabilities:
    bool(np.sum(weights) == 1)    
    
    df_samples = df.sample(n=40000, weights=weights, replace=True, random_state=19)
    

    verify that label draws are approximately uniform:

    print(df_samples.label.value_counts()/len(df_samples))
    
    # sampling frequency by group:
    # 0     0.203325
    # 2     0.201075
    # 18    0.200925
    # 19    0.198850
    # 1     0.195825