rstatisticssample-size

SMOTE in r reducing sample size significantly


I have a data set with around 130000 records. The records divided in two class of target variable,0 & 1. 1 contains only 0.09% of total proportion.

I'm running my analysis in R-3.5.1 on Windows 10. I used SMOTE algorithm to work with this imbalanced data set.

I used following code to handle imbalanced data set

library(DMwR)
data_code$target=as.factor(data_code$target) #Converted to factor as 
# SMOTE works with factor data type
smoted_data <- SMOTE(target~., data_code, perc.over=100)

But after executing the code,I'm seeing the count for 0 is 212 & 1 is also 212 which is significant reduction of my sample size.Can you suggest me how do I handle this imbalanced data set with SMOTE without changing my data size


Solution

  • You need to play a bit with the two parameters avaiable from the function: perc.over and perc.under.

    As per the doc from SMOTE:

    The parameters perc.over and perc.under control the amount of over-sampling of the minority class and under-sampling of the majority classes, respectively.

    So:

    perc.over will tipically be a number above 100. With this type of values, for each case in the orginal data set belonging to the minority class, perc.over/100 new examples of that class will be created

    I can't see your data but, if your minority class has 100 cases and perc.over=100, the algorithm will generate 100/100 = 1 new cases from that class.

    The parameter perc.under controls the proportion of cases of the majority class that will be randomly selected for the final "balanced" data set. This proportion is calculated with respect to the number of newly generated minority class cases.

    So for example a value of perc.under=100 will select from the majority class on the original data the same amount of observation that have been generated for the minority class.

    In our example just 1 new case was generated so it will add just another one, resulting in a new dataset with 2 cases.

    I suggest to use values above 100 for perc.over, and an even higher value for perc.under (defaults are 100 and 200).

    Keep in mind that you're adding new observations that are not real in your minority class, I'd try to keep these under control.

    Numeric example:

    set.seed(123)
    
    data <- data.frame(var1 = sample(50),
                       var2 = sample(50),
                       out = as.factor(rbinom(50, 1, prob=0.1)))
    
    table(data$out)
    #  0  1 
    # 43  7 # 50 rows total (original data)
    smote_data <- DMwR::SMOTE(out ~ var1, data, perc.over = 200, perc.under = 400)
    table(smote_data$out)
    #  0  1 
    # 56 21 # 77 rows total (smote data)