rdplyrnestedsampleoversampling

Oversample within in group


I would like to oversample such that I have balance on my binary dependent variable within each group in my data set.

So my data looks like this:

library(dplyr)
library(purrr)
library(tidyr)
seed(123)

# example data
(data <- tibble(
  country = c("France", "France", "France", 
              "UK", "UK", "UK", "UK", "UK", "UK"),
  YES = c(0, 0, 1, 
          0, 0, 0, 0, 1, 1),
  X = rnorm(9, 0 ,1)
))

# A tibble: 9 x 3
  country   YES       X
  <chr>   <dbl>   <dbl>
1 France      0 -1.12  
2 France      0 -0.200 
3 France      1  0.781 
4 UK          0  0.100 
5 UK          0  0.0997
6 UK          0 -0.380 
7 UK          0 -0.0160
8 UK          1 -0.0265
9 UK          1  0.860

I am trying to achieve balance on YES within France and the UK by oversampling. In France I would like to have 4 observations and in the UK 8 so that one random sample could look like this):

# A tibble: 12 x 3
  country   YES       X
  <chr>   <dbl>   <dbl>
1 France      0 -1.12  
2 France      0 -0.200 
3 France      1  0.781 
3 France      1  0.781 
4 UK          0  0.100 
5 UK          0  0.0997
6 UK          0 -0.380 
7 UK          0 -0.0160
8 UK          1 -0.0265
9 UK          1  0.860
8 UK          1 -0.0265
8 UK          1 -0.0265

My approach was this:

# oversample 1's within each country
(n_data <- data %>%
  group_by(country) %>%
  nest(.key = "original") %>%
  mutate(os = map(original, ~ group_by(., YES))) %>%
  mutate(os = map(os, ~ slice_sample(., replace = TRUE, prop = 1))))

# A tibble: 2 x 3
# Groups:   country [2]
  country original         os              
  <chr>   <list>           <list>          
1 France  <tibble [3 x 2]> <tibble [3 x 2]>
2 UK      <tibble [6 x 2]> <tibble [6 x 2]>
Warning message:
`.key` is deprecated 

So in OS the dimensions should be 4 x 2 and 8 x 2. Does anyone know how to do this?


Solution

  • This seems overcomplicated, but each individual step seems clear and robust:

    data %>% 
      count(country, YES) %>%
      group_by(country) %>%
      ## Figure out how many additional rows are needed
      mutate(
        goal_rows = max(n),
        extra_rows = goal_rows - n
      ) %>%
      select(country, YES, extra_rows) %>%
      ## Keep only the country/YES combinations that need extra rows
      filter(extra_rows > 0) %>%
      ## Join back to original data
      left_join(data, by = c("country", "YES")) %>%
      group_by(country) %>%
      ## Randomly keep the appropriate number of rows
      mutate(rand = rank(runif(n()))) %>%
      filter(rand <= extra_rows) %>%
      select(-extra_rows, -rand) %>%
      ## Combine oversampled rows with original data
      bind_rows(data) %>%
      arrange(country, YES)
    # # A tibble: 12 x 3
    # # Groups:   country [2]
    #    country   YES       X
    #    <chr>   <dbl>   <dbl>
    #  1 France      0  1.88  
    #  2 France      0 -0.0793
    #  3 France      1  0.812 
    #  4 France      1  0.812 
    #  5 UK          0 -1.66  
    #  6 UK          0 -0.797 
    #  7 UK          0  0.639 
    #  8 UK          0 -0.141 
    #  9 UK          1 -0.207 
    # 10 UK          1  1.30  
    # 11 UK          1 -0.207 
    # 12 UK          1  1.30