pythonrandomsample

random.sample(population, X) sometimes not contained in random.sample(population, Y) when Y>X


EDIT: Edited original post replacing my code with an MRE from user @no comment.

I noticed a seemingly non-intuitive behaviour using a seeded random.sample(population, k) call to sample from a list of files in a directory.

I would expect that sampling 100 items with k=100 while using a seed, guarantees that when I subsequently sample with k=800, the 100 sampled items are guaranteed to be within the list of 800 sampled items.

Below is an example for a number of population sizes and seeds:

import random

for n in 1000, 2000, 5000:
    print(end=f'{n=}: ')
    for i in range(10):
        random.seed(i)
        x = random.sample(range(n), 100)
        random.seed(i)
        y = random.sample(range(n), 800)
        print(sum(i==j for i,j in zip(x,y[:100])), end=' ')
    print()

The output shows the expected perfect match when sampling from a population of 1000 or 5000, but somehow randomly only a partial match when sampling from 2000:

n=1000: 100 100 100 100 100 100 100 100 100 100 
n=2000: 93 61 38 44 68 36 39 33 71 86 
n=5000: 100 100 100 100 100 100 100 100 100 100 

Similarly, after running my sampling script in some directories, I get mixed results. From some directories, the 100 are within the 800, but for others there are differences for a few files, some are here and not there and vice versa.

Are there any sources of randomness that I am not aware of?

I tried sorting the os.listdir() call that lists directory contents but this didn't change anything. I know I can refactor the script to work by first sampling a greater value and slicing the sample list, but I would expect my original script to work in the same way too.


Solution

  • The exact algorithm isn't documented/guaranteed. You're likely using CPython, which chooses one of two slightly different algorithms for efficiency, depending on whether you want a small or large percentage of the population. Both algorithms select one element at a time and append it to the sample, but they differ in how to select the next element and how to keep track of the selected or remaining elements. Each of the two algorithms actually has the property you desire. But what can happen is that you sample 100 with one algorithm and 800 with the other, and then they don't match.

    (If you noticed and are wondering why 100 out of 1000 is a "large percentage" while 800 out of 5000 is a "small percentage": It's not a fixed percentage, see the implementation for the formula.)