pythonresamplingsubsampling

How to resample without replacement considering consecutive three as one unit for each choice


The goal is to sample the n number of data points from the original population. But the original population has serial correlation (consider it as time series data) and I want to choose neighboring three as one unit for each choice. That is to say, the neighboring three data points have to be chosen each time. The choice has to be done without replacement.

It would repeat the choice until the number of sample data points reaches to n. Each chosen data point has to be unique. (Assume the population data points are all unique.)

How can I write this into code? I hope the code is fast.

def subsampling(self, population, size, consecutive = 3):
    #make seeds which doesn't have neighbors
    seed_samples = np.random.choice(population, 
                                    size = int(size/consecutive), 
                                    replace = False)
    target_samples = set(seed_samples)
    #add neighbors to each seed samples
    for dpoint in seed_samples:
        start = np.searchsorted(population, dpoint, side = 'right')
        neighbors = population[start:(start + consecutive -1)]
        target_samples.add(neighbors)
        
    return sorted(list(target_samples))

This code is my rough trial but it doesn't give the correct size because there can be duplicate.


Solution

  • Suppose the population is 1000 entries and you want 200 non-overlapping triplets.

    One simple method is: extract x[0], x[1],... x[199] 200 unique random numbers from 0 to 599 (600 = 1000-200*2). Sort the values and then required indexes for the triplets are:

    0. x[0], x[0]+1, x[0]+2
    1. x[1]+2, x[1]+3, x[1]+4
    2. x[2]+4, x[2]+5, x[2]+6
    ...
    n. x[n]+2*n, x[n]+2*n+1, x[n]+2*n+2
    ...
    199. x[199]+398, x[199]+399, x[199]+400