rrepeatimbalanced-datasample-size

Cross multiplication to equalize sample proportions


I have a larger dataset and below is a subset of that data. The category is the dependent variable and Day_1 and Day_2 are independent variables.

ID <- c("e-1", "e-2", "e-3", "e-8", "e-9", "e-10", "e-13", "e-16", "e-17", "e-20")
Day_1 <- c(0.58, 0.62, 0.78, 0.18, 0.98, 0.64, 0.32, 0.54, 0.94, 0.87)
Day_2 <- c(0.58, 0.65, 0.25, 0.34, 0.17, 0.82, 0.67, 0.39, 0.49, 0.86)
Category <- c(1, 1, 0, 1, 0, 1, 1, 1, 0, 1)

df <- data.frame(ID, Day_1, Day_2, Category)

As the sample sizes of Category 0 & 1 are different (3 - Category 0 and 7 Category 1), I want to perform a cross multiplication. That means repeating all category 0 data points 7 times, and all category 1 data points 3 times, so that both have a new sample size of 7*3. The final data frame should contain all the columns as 'df' but with all the added rows as well.

How I supposed to do this in R?


Solution

  • This might be the wrong approach, as you will increase the overall sample size and thus inflate the t-statistic.

    See this small example also with a binary dependent variable. By doubling the sample size (and not changing proportions of "am") you get different results.

    summary(glm(am ~ mpg, mtcars, family='binomial'))
    #             Estimate Std. Error z value Pr(>|z|)   
    # mpg           0.3070     0.1148   2.673  0.00751 **
      
    summary(glm(am ~ mpg, rbind(mtcars, mtcars), family='binomial'))
    #             Estimate Std. Error z value Pr(>|z|)   
    # mpg          0.30703    0.08121   3.781 0.000156 ***
    

    What you want are frequency weights which you derive by dividing population proportions (which in your case are both .5) by sample proportions. You can use mapply for that.

    mtcars <- transform(mtcars, 
                        w=mapply(`/`, 
                                 c(`0`=.5, `1`=.5), 
                                 proportions(table(am)))[as.character(am)])
    
    summary(glm(am ~ mpg, mtcars, weights=w, family='binomial'))
    #             Estimate Std. Error z value Pr(>|z|)   
    # mpg           0.3005     0.1123   2.676  0.00746 **