rdataframeoversampling

Oversampling method using R


I'm studying oversampling method using R. Let's say I want to do oversampling from the data df.

df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)),
                 x1=rnorm(100),
                 x2=rnorm(100))

Obviously, df has 10 No's and 90Yes's. So y is imbalanced. I tried to use ubBalance function to make y balanced, but it seems like that I cannot use it because I use R version 4. Is there a easy way to do oversampling in R version 4.


Solution

  • You could use a Random Walk Overslamping using the rwo function from the imbalance package:

    Generates synthetic minority examples for a dataset trying to preserve the variance and mean of the minority class. Works on every type of dataset.

    Here a reproducible example:

    df <- data.frame(y=rep(as.factor(c('Yes', 'No')), times=c(90, 10)),
                     x1=rnorm(100),
                     x2=rnorm(100))
    
    library(imbalance)
    colnames(df) <- c("Class", "x1", "x2")
    new_df <-rwo(df, numInstances = 50)
    new_df
    #>    Class          x1          x2
    #> 1     No -1.16439984  0.21856395
    #> 2     No  1.20744623  0.28858048
    #> 3     No  1.56528275 -0.07579441
    #> 4     No -1.03733411  0.01835535
    #> 5     No -0.70526984 -2.01477788
    #> 6     No -0.80978490  0.64829995
    #> 7     No  0.32493643 -0.05699719
    #> 8     No -0.98764951 -1.72838623
    #> 9     No -0.42004551  0.79171386
    #> 10    No -2.02128473  0.41171867
    #> 11    No -0.84667118 -1.31055008
    #> 12    No -0.41447116  0.73619119
    #> 13    No -0.59519331 -2.12420980
    #> 14    No -1.87381529  0.36029347
    #> 15    No -1.71772198 -0.67236749
    #> 16    No -1.91984498  0.30281031
    #> 17    No -0.30854811  1.07314736
    #> 18    No -2.09342702 -0.33375116
    #> 19    No -0.57984243  0.94788328
    #> 20    No -1.04299574  0.97960623
    #> 21    No -0.48914322  1.09651605
    #> 22    No  1.95909036  0.62301445
    #> 23    No  0.32071004 -2.08889830
    #> 24    No -0.98998047  0.45250458
    #> 25    No  0.78258023 -0.57429362
    #> 26    No  0.04426842 -1.48160646
    #> 27    No -1.61386524 -0.07911380
    #> 28    No -0.54491597  0.24783255
    #> 29    No -1.55084192  0.44819029
    #> 30    No  0.40391743 -2.00554911
    #> 31    No -0.57996600 -1.70075786
    #> 32    No  0.34502429 -0.11452995
    #> 33    No -1.42240697 -0.15749236
    #> 34    No  0.56406328 -1.96536380
    #> 35    No -0.99870646  0.16643333
    #> 36    No  0.29262027 -1.86874500
    #> 37    No  1.44551833  0.35333586
    #> 38    No  1.69167557  0.16451481
    #> 39    No -0.63712453 -2.37375325
    #> 40    No -1.13339974  0.25853248
    #> 41    No  1.60384482  0.21507984
    #> 42    No -0.76946285  0.27068821
    #> 43    No  0.58484861 -2.48727381
    #> 44    No -1.33939478 -0.11824381
    #> 45    No -1.01812834 -1.85177192
    #> 46    No  0.57773883 -0.29486029
    #> 47    No -1.11804972 -1.39796677
    #> 48    No -1.79134432 -0.07027661
    #> 49    No -0.56362892 -1.66805640
    #> 50    No -1.61152940  0.06337827
    plotComparison(df, rbind(df, new_df), attrs = names(new_df)[1:3])
    

    Created on 2022-07-10 by the reprex package (v2.0.1)