pythonpython-3.xlistalgorithmrandom

Fastest way to sample most numbers with minimum difference larger than a value from a Python list


Given a list of 20 float numbers, I want to find a largest subset where any two of the candidates are different from each other larger than a mindiff = 1.. Right now I am using a brute-force method to search from largest to smallest subsets using itertools.combinations. As shown below, the code finds a subset after 4 s for a list of 20 numbers.

from itertools import combinations
import random
from time import time

mindiff = 1.
length = 20
random.seed(99)
lst = [random.uniform(1., 10.) for _ in range(length)]

t0 = time()
n = len(lst)
sample = []
found = False
while not found:
    # get all subsets with size n
    subsets = list(combinations(lst, n))
    # shuffle to ensure randomness
    random.shuffle(subsets)
    for subset in subsets:
        # sort the subset numbers
        ss = sorted(subset)
        # calculate the differences between every two adjacent numbers
        diffs = [j-i for i, j in zip(ss[:-1], ss[1:])]
        if min(diffs) > mindiff:
            sample = set(subset)
            found = True
            break
    # check subsets with size -1
    n -= 1

print(sample)
print(time()-t0)

Output:

{2.3704888087015568, 4.365818049020534, 5.403474619948962, 6.518944556233767, 7.8388969285727015, 9.117993839791751}
4.182451486587524

However, in reality I have a list of 200 numbers, which is infeasible for a brute-froce enumeration. I want a fast algorithm to sample just one random largest subset with a minimum difference larger than 1. Note that I want each sample has randomness and maximum size. Any suggestions?


Solution

  • My previous answer assumed you simply wanted a single optimal solution, not a uniform random sample of all solutions. This answer assumes you want one that samples uniformly from all such optimal solutions.

    1. Construct a directed acyclic graph G where there is one node for each point, and nodes a and b are connected when b - a > mindist. Also add two virtual nodes, s and t, where s -> x for all x and x -> t for all x.

    2. Calculate for each node in G how many paths of length k exist to t. You can do this efficiently in O(n^2 k) time using dynamic programming with a table P[x][k], filling initially P[x][0] = 0 except P[t][0] = 1, and then P[x][k] = sum(P[y][k-1] for y in neighbors(x)).

      Keep doing this until you reach the maximum k - you now know the size of the optimal subset.

    3. Uniformly sample a path of length k from s to t using P to weight your choices.

      This is done by starting at s. We then look at each neighbor of s and choose one randomly with a weighting dictated by P[s][k]. This gives us our first element of the optimal set.

      We then repeatedly perform this step. We are at x, look at the neighbors of x and pick one randomly using weights P[x][k-i] where i is the step we're at.

    4. Use the nodes you sampled in 3 as your random subset.

    An implementation of the above in pure Python:

    import random
    
    def sample_mindist_subset(xs, mindist):
        # Construct directed graph G.
        n = len(xs)
        s = n; t = n + 1  # Two virtual nodes, source and sink.
        neighbors = {
            i: [t] + [j for j in range(n) if xs[j] - xs[i] > mindist]
            for i in range(n)}
        neighbors[s] = [t] + list(range(n))
        neighbors[t] = []
    
        # Compute number of paths P[x][k] from x to t of length k.
        P = [[0 for _ in range(n+2)] for _ in range(n+2)]
        P[t][0] = 1
        for k in range(1, n+2):
            for x in range(n+2):
                P[x][k] = sum(P[y][k-1] for y in neighbors[x])
    
        # Sample maximum length path uniformly at random.
        maxk = max(k for k in range(n+2) if P[s][k] > 0)
        path = [s]
        while path[-1] != t:
            candidates = neighbors[path[-1]]
            weights = [P[cn][maxk-len(path)] for cn in candidates]
            path.append(random.choices(candidates, weights)[0])
    
        return [xs[i] for i in path[1:-1]]
    

    Note that if you want to sample from the same set of numbers many times, you don't have to recompute P every single time and can re-use it.