algorithmdata-structuresprobabilityprobability-theoryreservoir-sampling

Reservoir Sampling unable to understand probability


To make clear following is the question:

Given an input stream of an indeterminate length, how do you return a random member of that stream (with equal probability for each), given that you're not allowed to store more than a constant number of inputs, and you can only go through the inputs once

The solution to this problem seems to be Reservoir Sampling and it states below. "First, you want to make a reservoir (array) of 1,000 elements and fill it with the first 1,000 elements in your stream. That way if you have exactly 1,000 elements, the algorithm works. This is the base case.

Next, you want to process the i'th element (starting with i = 1,001) such that at the end of processing that step, the 1,000 elements in your reservoir are randomly sampled amongst the i elements you've seen so far. How can you do this? Start with i = 1,001. With what probability after the 1001'th step should element 1,001 (or any element for that matter) be in the set of 1,000 elements? The answer is easy: 1,000/1,001."

I am unable to understand the last sentence "The answer is easy: 1,000/1,001". Shouldn't be the probability of finding 1 element in array of 1001 elements be 1/1001 and not 1000/1001 ? Isn't the Sample space equal to 1001 and favorable number of outcomes equal to 1 ?


Solution

  • There are 1,001 elements. 1,000 of them are in the sample. One is outside of the sample. So the probability that a particular element is the one which is outside is 1 out of 1,001, and the probability that it is one of the thousand elements inside the sample is 1,000 out of 1,001.