I'm not sure what approach to take to this problem (I'm new to both R and statistical analysis). I have a highly imbalanced class in my data set:
PCL_Sum n
* <dbl> <int>
1 0 300
2 1 25
I realise that I could use downSample for this data to get a balanced set with 25 randomly selected 0s and my existing 25 1s. But, I would like to repeat this process 12 times so that all of my '0' data is used, leaving me with 12 sets of data.
I realise that I could do this 12 times by hand, but I'd like to automate the process. Could someone give me a general idea of how they would approach the problem? I realise that there is likely an answer out there but I'm having trouble understanding the documentation I've found. Thank you!
Is there something undesirable about downSample? It seems like you could just apply it 12 times and go from there for your samples. Here's an example.
data(oil)
table(oilType)
downSample(fattyAcids, oilType)
mysamples <- lapply(1:12, function(x){downSample(fattyAcids, oilType)})
Then you can call mysamples[[1]]
for the first set and so on.
> mysamples[[1]]
Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
1 11.5 5.1 27.8 54.5 0.2 0.4 0.1 A
2 11.4 5.8 34.5 48.3 1.0 0.1 0.1 A
3 6.1 4.1 24.0 64.3 0.1 0.3 0.1 B
4 6.1 4.1 26.7 61.0 0.6 0.3 0.2 B
5 9.7 3.4 59.3 20.5 0.1 1.5 1.2 C
6 9.6 3.3 57.7 20.7 0.2 1.5 1.8 C
7 9.3 2.8 65.0 17.0 3.9 0.5 0.7 D
8 10.9 2.7 76.7 7.9 0.8 0.1 0.1 D
9 10.9 3.6 26.0 52.6 5.5 0.4 0.2 E
10 10.5 4.2 24.4 52.1 7.5 0.4 0.1 E
11 5.4 2.0 53.2 28.9 7.3 0.6 1.3 F
12 5.1 2.3 55.9 27.4 6.8 0.5 0.5 F
13 10.0 2.3 36.9 47.1 2.2 0.5 0.5 G
14 10.7 1.8 30.2 55.5 0.9 0.5 0.3 G
> mysamples[[2]]
Palmitic Stearic Oleic Linoleic Linolenic Eicosanoic Eicosenoic Class
1 13.0 6.2 25.8 55.0 0.8 0.1 0.1 A
2 13.1 5.7 31.7 49.5 0.6 0.1 0.1 A
3 5.6 4.2 25.7 58.9 1.7 2.8 0.9 B
4 6.1 4.1 24.0 64.3 0.1 0.3 0.1 B
5 9.6 3.3 57.7 20.7 0.2 1.5 1.8 C
6 10.0 3.3 60.0 21.3 0.2 1.5 1.3 C
7 9.3 2.8 65.0 17.0 3.9 0.5 0.7 D
8 14.9 2.6 68.2 12.8 0.6 0.4 0.3 D
9 10.9 3.6 26.0 52.6 5.5 0.4 0.2 E
10 9.7 3.9 25.1 54.2 5.9 0.1 0.1 E
11 5.1 2.3 55.9 27.4 6.8 0.5 0.5 F
12 5.5 1.7 59.0 21.3 9.3 0.6 1.5 F
13 10.7 1.8 30.2 55.5 0.9 0.5 0.3 G
14 10.0 2.3 36.9 47.1 2.2 0.5 0.5 G
Edit for unique samples:
df <- data.frame(class = c(rep("A", 25), rep("B", 300)),
value = 1:325)
mysamples <- lapply(1:12, function(x){df[c(1:25, (x * 25 + 1) : ((x+1) * 25)), ]})
This will take the first 25 of the majority class in sample 1, the next 25 in sample 2, etc. up to the 12th sample.