I'm a newbie in R so just bear with me.
So I'm trying to perform stratified sampling in such a way that, it will use a 2 column strata but with both columns satisfying specific values.
This is my code:
library(splitstackshape)
set.seed(1)
dat1 <- data.frame(ID = 1:100,
A = sample(c("AA", "BB", "CC", "DD", "EE"), 100, replace = TRUE),
B = sample(c(30,40,50),100,replace = TRUE), C = sample(c(1:10),100,replace = TRUE),
D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
E = sample(c("M", "F"), 100, replace = TRUE))
stratified(dat1, c("B", "C"), 0.1, select = list(B = 30, C = c(8:10)))
To my understanding this function first generates a strata of size 10% and from that it selects those records that satisfies the condition B=30 and c between 8 and 10.
As a result the size of the strata gets reduced from the initial 10%.
What my question is that, is there any way that will generate a strata which consists of records in which column B is having value 30 while column C can have values between 8 and 10 with the nrow()
of the resultant sample being 10% of the original data frame?
I'm using stratified()
from "splitstackshape". If stratified()
cannot handle this, are there any other packages out there that can perform this kind of operation?
Continuing from the sample data in the original answer, I would use a two-step process:
Create a subset with the levels you're interested in.
sub1 <- as.data.table(dat1)[B == 30 & C %in% 8:10][order(C)]
Figure out what percentage you need to sample. Here, I've set the final number of rows to 500, since the sample data doesn't have 1000 rows when a subset is taken. To get the required percentage, it's as simple as the desired number of rows divided by the total number of rows in the subset...
rows_wanted <- 500
set.seed(2)
out <- stratified(sub1, "C", rows_wanted/nrow(sub1))
## Check how many rows we have per group
out[, .N, .(B, C)]
# B C N
# 1: 30 8 157
# 2: 30 9 169
# 3: 30 10 174
The stratified
function filters the data first, and then does the sampling. Consider the following:
library(splitstackshape)
set.seed(1)
n <- 10000
dat1 <- data.frame(ID = sequence(n),
A = sample(c("AA", "BB", "CC", "DD", "EE"), n, replace = TRUE),
B = sample(c(30,40,50),n,replace = TRUE),
C = sample(c(1:10),n,replace = TRUE),
D = sample(c("CA", "NY", "TX"), n, replace = TRUE),
E = sample(c("M", "F"), n, replace = TRUE))
Sample, as you've shown.
mySample <- stratified(dat1, c("B", "C"), 0.1, select = list(B = 30, C = 8:10))
nrow(mySample)
# [1] 98
Compare that to how many rows you should expect in the output:
as.data.table(dat1)[, .N, .(B, C)][B == 30 & C %in% 8:10, list(N = round(N * .1)), .(B, C)][order(C)]
# B C N
# 1: 30 8 31
# 2: 30 9 33
# 3: 30 10 34
And compare the above to what you get from the stratified
function.
mySample[, .N, .(B, C)]
# B C N
# 1: 30 8 31
# 2: 30 9 33
# 3: 30 10 34