rtidyversenestsampling

sample a predefined number of observations per group


My data looks like this:

> data|>head(20)|>dput()
structure(list(id = c("42190204", "34390202", "34310104", "34310104", 
"34310104", "34310104", "34310104", "34310104", "34310104", "34310104", 
"34310104", "34310104", "34310104", "34310104", "34310104", "34310104", 
"34310104", "34310104", "34310104", "34310104"), sample_size = c(0, 
7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c(NA, 
20L), class = "data.frame")

Also, my data has NAs in sample_size which I am filtering out. For some id, i have as many as 800+ observations, so in some cases my sample_size is larger than number of available observations for group id and in some cases my sample_size is smaller than the number of observations. In the first case would be ideal to just repeat the existing observations to make up the sample_size and in the second case - just select observations at random.

from each group (which is id column) i need to randomly select number of observations equal to the sample_size for this group (=id).

I am looking at this answer and trying it with nest() but it does not work.. the code finishes with no error, but the number of observations under each group does not match the sample_size... How to fix this?

My code:

test<-data|>filter(!is.na(sample_size))|>nest(id)|>
  mutate(
    data_sample = map2(data, sample_size, ~ slice_sample(.x, n = .y))
  ) %>%
  unnest(cols = data_sample)

Solution

  • It works if you don't nest the id column (having an example value column to sample from here):

    library(tidyverse)
    
    set.seed(22)
    
    n_samples <- tibble(
      id = 1:100,
      sample_size = sample(0:20, 100, replace = TRUE)
    )
    
    tibble(id = rep(1:100, each = 10), 
           value = runif(1000, 10, 100)) |> 
      left_join(n_samples, by = join_by(id)) |> 
      nest(data = -c(id, sample_size)) |> 
      mutate(data_sample = map2(data,
                                sample_size, 
                                ~ slice_sample(.x, n = .y, replace = TRUE))) |> 
      unnest(data_sample) |> 
      select(-data)
    #> # A tibble: 1,097 × 3
    #>       id sample_size value
    #>    <int>       <int> <dbl>
    #>  1     1           5  63.9
    #>  2     1           5  28.5
    #>  3     1           5  83.4
    #>  4     1           5  83.4
    #>  5     1           5  98.5
    #>  6     2           8  68.2
    #>  7     2           8  39.8
    #>  8     2           8  68.2
    #>  9     2           8  39.8
    #> 10     2           8  68.2
    #> # ℹ 1,087 more rows
    

    This returns a set of sample_size rows for each id. The replace = TRUE allows re-sampling when the required size is bigger than the number of rows in the group.