pythonalgorithmsubsequence

Algorithm to select multiple non-overlapping subsequences of given sequence


You have been given the input data (a sequence of items), for example:

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

and your task is to randomly select M non-overlapping samples (subsequences) of the same size N.

For example, if the task was to select 3 samples of size 3, one of the solutions would be:

[3, 4, 5]
[8, 9, 10]
[11, 12, 13]

The samples are unordered so [8, 9, 10], [3, 4, 5], [11, 12, 13] is the same solution. All solutions are expected to have an equal probability of being selected.


My algorithm:

  1. Select randomly first sample: [11, 12, 13]
  2. Remaining input data are [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [14, 15, 16].
  3. Select randomly second sample from the remaining input data: [3, 4, 5].
  4. Remaining input data (big enough) are [6, 7, 8, 9, 10] and [14, 15, 16].
  5. Select randomly third sample from the remaining input data: [8, 9, 10].

Sadly, this algorithm does not work when the samples are too big. If the task was to select 3 samples of size 5, there exists a solution, but if you use my algorithm and select randomly the first sample as [3, 4, 5, 6, 7], the algorithm will fail.


Of course there is also an obvious brute-force algorithm: find all possible solutions and select randomly one of them. But I was hoping for something more "clever" (time and space efficient).


Solution

  • import random
    
    seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
    M = 3
    N = 3
    
    dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
    starts = [d + i*(N-1) for i, d in enumerate(dividers)]
    samples = [seq[start : start+N] for start in starts]
    print(samples)
    

    Example output (Attempt This Online!):

    [[3, 4, 5], [10, 11, 12], [13, 14, 15]]
    

    Idea is that if you want 3 subsequences of size 3 from overall 16 items, then 7 elements are not in the subsequences. They're instead in the 4 gaps before/after the subsequences. Each gap can be empty. So find 4 nonnegative integers with sum 7 and use them as gap sizes. How to do that was inspired by this answer.

    Alternative explanation: If we had N=1, this would simply be sampling M single items. But in general, each of them is accompanied by the next N-1 items to make up a subsequence. So we basically do sample(range(len(seq)), M) for the N=1 case, but for the general case shrink the range by M*(N-1) to allow room for the M times N-1 accompaniers, which we then add.

    I tried choosing samples 120000 times and counting how often each starts tuple occurred. Each of the 120 possibilities occurred about 1000 times as expected:

    120
    ((0, 3, 6), 1030)
    ((0, 3, 7), 1007)
    ((0, 3, 8), 956)
    ...
    ((6, 9, 13), 1010)
    ((6, 10, 13), 1006)
    ((7, 10, 13), 986)
    

    Code:

    import random
    from collections import Counter
    
    seq = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
    M = 3
    N = 3
    
    ctr = Counter()
    for _ in range(120_000):
        dividers = sorted(random.sample(range(len(seq) - M*N + M), M))
        starts = [d + i*(N-1) for i, d in enumerate(dividers)]
        ctr[*starts] += 1
    
    print(len(ctr))
    for x in sorted(ctr.items()):
        print(x)
    

    Attempt This Online!