rresamplingdownsampling

How to repeat downSample in R?


I'm not sure what approach to take to this problem (I'm new to both R and statistical analysis). I have a highly imbalanced class in my data set:


  PCL_Sum     n
*     <dbl> <int>
1         0   300
2         1    25

I realise that I could use downSample for this data to get a balanced set with 25 randomly selected 0s and my existing 25 1s. But, I would like to repeat this process 12 times so that all of my '0' data is used, leaving me with 12 sets of data.

I realise that I could do this 12 times by hand, but I'd like to automate the process. Could someone give me a general idea of how they would approach the problem? I realise that there is likely an answer out there but I'm having trouble understanding the documentation I've found. Thank you!


Solution

  • Is there something undesirable about downSample? It seems like you could just apply it 12 times and go from there for your samples. Here's an example.

    data(oil)
    table(oilType)
    downSample(fattyAcids, oilType)
    mysamples <- lapply(1:12, function(x){downSample(fattyAcids, oilType)})
    

    Then you can call mysamples[[1]] for the first set and so on.

    > mysamples[[1]]
       Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
    1      11.5     5.1  27.8     54.5       0.2        0.4        0.1     A
    2      11.4     5.8  34.5     48.3       1.0        0.1        0.1     A
    3       6.1     4.1  24.0     64.3       0.1        0.3        0.1     B
    4       6.1     4.1  26.7     61.0       0.6        0.3        0.2     B
    5       9.7     3.4  59.3     20.5       0.1        1.5        1.2     C
    6       9.6     3.3  57.7     20.7       0.2        1.5        1.8     C
    7       9.3     2.8  65.0     17.0       3.9        0.5        0.7     D
    8      10.9     2.7  76.7      7.9       0.8        0.1        0.1     D
    9      10.9     3.6  26.0     52.6       5.5        0.4        0.2     E
    10     10.5     4.2  24.4     52.1       7.5        0.4        0.1     E
    11      5.4     2.0  53.2     28.9       7.3        0.6        1.3     F
    12      5.1     2.3  55.9     27.4       6.8        0.5        0.5     F
    13     10.0     2.3  36.9     47.1       2.2        0.5        0.5     G
    14     10.7     1.8  30.2     55.5       0.9        0.5        0.3     G
    > mysamples[[2]]
       Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
    1      13.0     6.2  25.8     55.0       0.8        0.1        0.1     A
    2      13.1     5.7  31.7     49.5       0.6        0.1        0.1     A
    3       5.6     4.2  25.7     58.9       1.7        2.8        0.9     B
    4       6.1     4.1  24.0     64.3       0.1        0.3        0.1     B
    5       9.6     3.3  57.7     20.7       0.2        1.5        1.8     C
    6      10.0     3.3  60.0     21.3       0.2        1.5        1.3     C
    7       9.3     2.8  65.0     17.0       3.9        0.5        0.7     D
    8      14.9     2.6  68.2     12.8       0.6        0.4        0.3     D
    9      10.9     3.6  26.0     52.6       5.5        0.4        0.2     E
    10      9.7     3.9  25.1     54.2       5.9        0.1        0.1     E
    11      5.1     2.3  55.9     27.4       6.8        0.5        0.5     F
    12      5.5     1.7  59.0     21.3       9.3        0.6        1.5     F
    13     10.7     1.8  30.2     55.5       0.9        0.5        0.3     G
    14     10.0     2.3  36.9     47.1       2.2        0.5        0.5     G
    

    Edit for unique samples:

    df <- data.frame(class = c(rep("A", 25), rep("B", 300)),
                     value = 1:325)
    mysamples <- lapply(1:12, function(x){df[c(1:25, (x * 25 + 1) : ((x+1) * 25)), ]})
    

    This will take the first 25 of the majority class in sample 1, the next 25 in sample 2, etc. up to the 12th sample.