rsplitstackshape

Stratified sampling with constraints


I'm a newbie in R so just bear with me.

So I'm trying to perform stratified sampling in such a way that, it will use a 2 column strata but with both columns satisfying specific values.

This is my code:

library(splitstackshape)
set.seed(1)
dat1 <- data.frame(ID = 1:100,
                   A = sample(c("AA", "BB", "CC", "DD", "EE"), 100, replace = TRUE),
                   B = sample(c(30,40,50),100,replace = TRUE), C = sample(c(1:10),100,replace = TRUE),
                   D = sample(c("CA", "NY", "TX"), 100, replace = TRUE),
                   E = sample(c("M", "F"), 100, replace = TRUE))

stratified(dat1, c("B", "C"), 0.1, select = list(B = 30, C = c(8:10)))

To my understanding this function first generates a strata of size 10% and from that it selects those records that satisfies the condition B=30 and c between 8 and 10.

As a result the size of the strata gets reduced from the initial 10%.

What my question is that, is there any way that will generate a strata which consists of records in which column B is having value 30 while column C can have values between 8 and 10 with the nrow() of the resultant sample being 10% of the original data frame?

I'm using stratified() from "splitstackshape". If stratified() cannot handle this, are there any other packages out there that can perform this kind of operation?


Solution

  • Update

    Continuing from the sample data in the original answer, I would use a two-step process:

    1. Create a subset with the levels you're interested in.

      sub1 <- as.data.table(dat1)[B == 30 & C %in% 8:10][order(C)]
      
    2. Figure out what percentage you need to sample. Here, I've set the final number of rows to 500, since the sample data doesn't have 1000 rows when a subset is taken. To get the required percentage, it's as simple as the desired number of rows divided by the total number of rows in the subset...

      rows_wanted <- 500
      set.seed(2)
      out <- stratified(sub1, "C", rows_wanted/nrow(sub1))
      
      ## Check how many rows we have per group
      out[, .N, .(B, C)]
      #     B  C   N
      # 1: 30  8 157
      # 2: 30  9 169
      # 3: 30 10 174
      

    Original answer

    The stratified function filters the data first, and then does the sampling. Consider the following:

    library(splitstackshape)
    set.seed(1)
    n <- 10000
    dat1 <- data.frame(ID = sequence(n),
                       A = sample(c("AA", "BB", "CC", "DD", "EE"), n, replace = TRUE),
                       B = sample(c(30,40,50),n,replace = TRUE), 
                       C = sample(c(1:10),n,replace = TRUE),
                       D = sample(c("CA", "NY", "TX"), n, replace = TRUE),
                       E = sample(c("M", "F"), n, replace = TRUE))
    

    Sample, as you've shown.

    mySample <- stratified(dat1, c("B", "C"), 0.1, select = list(B = 30, C = 8:10))
    nrow(mySample)
    # [1] 98
    

    Compare that to how many rows you should expect in the output:

    as.data.table(dat1)[, .N, .(B, C)][B == 30 & C %in% 8:10, list(N = round(N * .1)), .(B, C)][order(C)]
    #     B  C  N
    # 1: 30  8 31
    # 2: 30  9 33
    # 3: 30 10 34
    

    And compare the above to what you get from the stratified function.

    mySample[, .N, .(B, C)]
    #     B  C  N
    # 1: 30  8 31
    # 2: 30  9 33
    # 3: 30 10 34