rapplyrollapply

rollapply based on values


I would like to resample a large data set with an unequal number of observations across the range in the data so that each range has an equal number of observations.

It seems like rollapply would be the way to do this, but it doesn't appear that it can be convinced to define its rolling window based on the data values?

For example:

set.seed(12345)    
z <- sort(rnorm(100,100,40))
rollapply(z, 20, function(x){sample(x,20,replace=TRUE)}, by=20) 

This does a great job of taking a list of numbers and resampling it every 20 numbers, however, I would like it to start at the lowest value and resample within a regular bin of values. For the above example the (left edge) bins could be defined like:

(0:10)*(max(z)-min(z))/10+min(z)

I know I could write a for loop and do this, but I am looking for a faster / simpler method.

An input vector with unequal distribution of observations between the ranges 1:10 and 11:20: c( 1, 2, 2, 3, 3, 3, 5, 6, 7, 11, 13, 13, 20) Resampled 5 times at 2 intervals of 10 units (i.e from 1:10 and 11:20)each interval sampled 5 times could produce:

c( 3, 1, 7, 3, 2, 11,20,11,13,20)


Solution

  • I guess the for loop is the simplest way to do it. The solution I ultimately developed is for a data frame, but is is essentially the same solution you would use for a simple vector (as in the wording of my original question).

    Fake data with an uneven distribution:

    test <- data.frame(Length=rlnorm(1000,2,1), Weight=rlnorm(1000,3,2))
    

    Define the resampling function:

    resamplr <- function(data) {
      bins <- (0:9) * (max(data$Length) - min(data$Length)) / 10 + min(data$Length)    #define a vector representing the left edge of bins.
      step <- (max(data$Length) - min(data$Length)) / 10 + .000001    #define the step and add a little so you don't land on any number exactly (i.e right edge)
      result <- NULL
      for (i in 1:length(bins)) {
        temp <- data[data$Length >= bins[i] & data$Length < (bins[i] + step), ]   #select data range
        result <- rbind(result, temp[sample(nrow(temp), 10, replace = T),])  #randomly sample it with replacement, and tack it onto the resampling from the previous range.
      }
      return(result)
    }
    

    Then to use the function:

    resamplr(test) 
    

    Refinements and suggestions improving the approach are of course appreciated...