rsubsampling

Subsampling from a set with the assumption that each member would be picked at least one time in r


I need a code or idea for the case that we have a dataset of 1000 rows. I want to subsample from rows with the size of 800 for multiple times (I dont know how many times should I repeat). How should I control that all members would be picked at least in one run? I need the code in r.

To make the question more clear, lets define the row names as:

rownames(dataset) = A,B,C,D,E,F,G,H,J,I

if I subsample 3 times:

A,B,C,D,E,F,G,H
D,E,A,B,H,J,F,C
F,H,E,A,B,C,D,J

The I is not in any of the subsample sets. I would like to do subsampling for 90 or 80 percent of the data for many times but I expect all the rows would be chosen at least in one of the subsample sets. In the above sample the element I should be picked in at least one of the subsamples.


Solution

  • One way to do this is random sampling without replacement to designate a set of "forced" random picks, in other words have a single guaranteed appearance of each row, and decide ahead of time which subsample that guaranteed appearance will be in. Then, randomly sample the rest of the subsample.

    num_rows = 1000
    num_subsamples = 1000
    subsample_size = 900
    
    full_index = 1:num_rows
    
    dat = data.frame(i = full_index)
    
    # Randomly assign guaranteed subsamples
    # Make sure that we don't accidentally assign more than the subsample size
    # If we're subsampling 90% of the data, it'll take at most a few tries
    biggest_guaranteed_subsample = num_rows
    while (biggest_guaranteed_subsample > subsample_size) {
      # Assign the subsample that the row is guaranteed to appear in
      dat$guarantee = sample(1:num_subsamples, replace = TRUE)
      # Find the subsample with the most guaranteed slots taken
      biggest_guaranteed_subsample = max(table(dat$guarantee))
    }
    
    
    # Assign subsamples
    for (ss in 1:num_subsamples) {
      # Pick out any rows guaranteed a slot in that subsample
      my_sub = dat[dat$guarantee == ss, 'i']
      # And randomly select the rest
      my_sub = c(my_sub, sample(full_index[!(full_index %in% my_sub)], 
                                subsample_size - length(my_sub), 
                                replace = FALSE))
      # Do your subsample calculation here
    }