algorithmcomplexity-theoryshufflefisher-yates-shuffle

How to write a prioritized left-shuffle algorithm in O(n)?


There are shuffle algorithms like FisherYates. They take an array and return one with elements in random order. This runs in O(n).

What I'm trying to do is to implement a prioritized left-shuffle algorithm. What does that mean?

Let's take this example [ (1, 10), (2, 10), (3, 60), (4, 20) ]. The most probable result should be [ 3, 4, 1, 2 ] or [ 3, 4, 2, 1 ].


I tried implementing this, but I haven't found any solution in O(n).

O(n^2) in pseudocode based on FisherYates:

sum = 100  #100%
for i = 0 to n-2:
    r = random value between 0 and sum
    localsum = 0
    for j = i to n-1:
        localsum = localsum + pair[j].Probability
        if localsum >= r + 1:
            swap(i, j)
            break
    sum = sum - pair[i].Probability

What could probably improve this a bit: Sorting the elements decreasing by probability right in the beginning to minimize the number of swaps and the iterations in the inner loop.

Is there a better solution (maybe even in O(n))?


Solution

  • Update of my first answer:

    I've found a paper where the 'Roulette-wheel selection via stochastic acceptance' with O(1) is introduced. This makes the algorithm to O(n) and is simple to implement

    from random import randint
    from random import random
    import time
    
    data = [ (1, 10), (2, 10), (3, 60), (4, 20) ]
    
    def swap(i, j, array):
        array[j], array[i] = array[i], array[j]
    
    def roulette_wheel_selection(data, start, max_weight_limit):
        while True:
            r = random()
            r_index = randint(start, len(data) - 1)
            if r <= data[r_index][1] / max_weight_limit:
                return r_index
        
    
    def shuffle(data, max_weight):
        data = data.copy()
        n = len(data)
        for i in range(n-1):
            r_index = roulette_wheel_selection(data, i, max_weight)
            swap(i, r_index, data)
        return data
    
    def performance_test(iterations, data):
        start = time.time()
        max_weight = max([item[1] for item in data])
        for i in range(iterations):
            shuffle(data, max_weight)
        end = time.time()
        print(len(data), ': ',end - start)
        return end - start
    
    performance_test(1000, data)
    
    data2 = []
    for i in range(10):
        data2 += data
    performance_test(1000, data2)  
    
    data3 = []
    for i in range(100):
        data3 += data
    performance_test(1000, data3) 
    
    data4 = []
    for i in range(1000):
        data4 += data
    performance_test(1000, data4) 
    

    Performance Output

    4 :  0.09153580665588379
    40 :  0.6010794639587402
    400 :  5.142168045043945
    4000 :  50.09365963935852
    

    So it's linear time in n (data size). I updated from my first answer the constant from "updated sum" to "maximum weight of all data items" But sure it depends on the max_weight konstant. If someone has a strategy to update max_weight in a proper way, the performance would increase.