There are shuffle algorithms like FisherYates. They take an array and return one with elements in random order. This runs in O(n).
What I'm trying to do is to implement a prioritized left-shuffle algorithm. What does that mean?
[ (1, 60), (2, 10), (3, 10), (4, 20) ]
. Value 1 has 60%, value 2 has 10%, ...Let's take this example [ (1, 10), (2, 10), (3, 60), (4, 20) ]
. The most probable result should be [ 3, 4, 1, 2 ]
or [ 3, 4, 2, 1 ]
.
I tried implementing this, but I haven't found any solution in O(n).
O(n^2) in pseudocode based on FisherYates:
sum = 100 #100%
for i = 0 to n-2:
r = random value between 0 and sum
localsum = 0
for j = i to n-1:
localsum = localsum + pair[j].Probability
if localsum >= r + 1:
swap(i, j)
break
sum = sum - pair[i].Probability
What could probably improve this a bit: Sorting the elements decreasing by probability right in the beginning to minimize the number of swaps and the iterations in the inner loop.
Is there a better solution (maybe even in O(n))?
Update of my first answer:
I've found a paper where the 'Roulette-wheel selection via stochastic acceptance' with O(1) is introduced. This makes the algorithm to O(n) and is simple to implement
from random import randint
from random import random
import time
data = [ (1, 10), (2, 10), (3, 60), (4, 20) ]
def swap(i, j, array):
array[j], array[i] = array[i], array[j]
def roulette_wheel_selection(data, start, max_weight_limit):
while True:
r = random()
r_index = randint(start, len(data) - 1)
if r <= data[r_index][1] / max_weight_limit:
return r_index
def shuffle(data, max_weight):
data = data.copy()
n = len(data)
for i in range(n-1):
r_index = roulette_wheel_selection(data, i, max_weight)
swap(i, r_index, data)
return data
def performance_test(iterations, data):
start = time.time()
max_weight = max([item[1] for item in data])
for i in range(iterations):
shuffle(data, max_weight)
end = time.time()
print(len(data), ': ',end - start)
return end - start
performance_test(1000, data)
data2 = []
for i in range(10):
data2 += data
performance_test(1000, data2)
data3 = []
for i in range(100):
data3 += data
performance_test(1000, data3)
data4 = []
for i in range(1000):
data4 += data
performance_test(1000, data4)
Performance Output
4 : 0.09153580665588379
40 : 0.6010794639587402
400 : 5.142168045043945
4000 : 50.09365963935852
So it's linear time in n (data size). I updated from my first answer the constant from "updated sum" to "maximum weight of all data items" But sure it depends on the max_weight konstant. If someone has a strategy to update max_weight in a proper way, the performance would increase.