You have been given the input data (a sequence of items), for example:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
and your task is to randomly select M non-overlapping samples (subsequences) of the same size N.
For example, if the task was to select 3 samples of size 3, one of the solutions would be:
[3, 4, 5]
[8, 9, 10]
[11, 12, 13]
The samples are unordered so [8, 9, 10]
, [3, 4, 5]
, [11, 12, 13]
is the same solution. All solutions are expected to have an equal probability of being selected.
My algorithm:
[11, 12, 13]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
and [14, 15, 16]
.[3, 4, 5]
.[6, 7, 8, 9, 10]
and [14, 15, 16]
.[8, 9, 10]
.Sadly, this algorithm does not work when the samples are too big. If the task was to select 3 samples of size 5, there exists a solution, but if you use my algorithm and select randomly the first sample as [3, 4, 5, 6, 7]
, the algorithm will fail.
Of course there is also an obvious brute-force algorithm: find all possible solutions and select randomly one of them. But I was hoping for something more "clever" (time and space efficient).
import random
seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
M = 3
N = 3
dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
starts = [d + i*(N-1) for i, d in enumerate(dividers)]
samples = [seq[start : start+N] for start in starts]
print(samples)
Example output (Attempt This Online!):
[[3, 4, 5], [10, 11, 12], [13, 14, 15]]
Idea is that if you want 3 subsequences of size 3 from overall 16 items, then 7 elements are not in the subsequences. They're instead in the 4 gaps before/after the subsequences. Each gap can be empty. So find 4 nonnegative integers with sum 7 and use them as gap sizes. How to do that was inspired by this answer.
Alternative explanation: If we had N=1, this would simply be sampling M single items. But in general, each of them is accompanied by the next N-1 items to make up a subsequence. So we basically do sample(range(len(seq)), M)
for the N=1 case, but for the general case shrink the range by M*(N-1)
to allow room for the M times N-1 accompaniers, which we then add.
I tried choosing samples 120000 times and counting how often each starts
tuple occurred. Each of the 120 possibilities occurred about 1000 times as expected:
120
((0, 3, 6), 1030)
((0, 3, 7), 1007)
((0, 3, 8), 956)
...
((6, 9, 13), 1010)
((6, 10, 13), 1006)
((7, 10, 13), 986)
Code:
import random
from collections import Counter
seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
M = 3
N = 3
ctr = Counter()
for _ in range(120_000):
dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
starts = [d + i*(N-1) for i, d in enumerate(dividers)]
ctr[*starts] += 1
print(len(ctr))
for x in sorted(ctr.items()):
print(x)