algorithm complexity-theory shuffle fisher-yates-shuffle

How to write a prioritized left-shuffle algorithm in O(n)?

There are shuffle algorithms like FisherYates. They take an array and return one with elements in random order. This runs in O(n).

What I'm trying to do is to implement a prioritized left-shuffle algorithm. What does that mean?

Prioritized: It does not take an array of values. It takes an array of value-probability pairs. E.g. [ (1, 60), (2, 10), (3, 10), (4, 20) ]. Value 1 has 60%, value 2 has 10%, ...
left-shuffle: The higher the probability of a value, the higher its chances to be far on the left of the array.

Let's take this example [ (1, 10), (2, 10), (3, 60), (4, 20) ]. The most probable result should be [ 3, 4, 1, 2 ] or [ 3, 4, 2, 1 ].

I tried implementing this, but I haven't found any solution in O(n).

O(n^2) in pseudocode based on FisherYates:

sum = 100  #100%
for i = 0 to n-2:
    r = random value between 0 and sum
    localsum = 0
    for j = i to n-1:
        localsum = localsum + pair[j].Probability
        if localsum >= r + 1:
            swap(i, j)
            break
    sum = sum - pair[i].Probability

What could probably improve this a bit: Sorting the elements decreasing by probability right in the beginning to minimize the number of swaps and the iterations in the inner loop.

Is there a better solution (maybe even in O(n))?

Solution

Update of my first answer:

I've found a paper where the 'Roulette-wheel selection via stochastic acceptance' with O(1) is introduced. This makes the algorithm to O(n) and is simple to implement

from random import randint
from random import random
import time

data = [ (1, 10), (2, 10), (3, 60), (4, 20) ]

def swap(i, j, array):
    array[j], array[i] = array[i], array[j]

def roulette_wheel_selection(data, start, max_weight_limit):
    while True:
        r = random()
        r_index = randint(start, len(data) - 1)
        if r <= data[r_index][1] / max_weight_limit:
            return r_index
    

def shuffle(data, max_weight):
    data = data.copy()
    n = len(data)
    for i in range(n-1):
        r_index = roulette_wheel_selection(data, i, max_weight)
        swap(i, r_index, data)
    return data

def performance_test(iterations, data):
    start = time.time()
    max_weight = max([item[1] for item in data])
    for i in range(iterations):
        shuffle(data, max_weight)
    end = time.time()
    print(len(data), ': ',end - start)
    return end - start

performance_test(1000, data)

data2 = []
for i in range(10):
    data2 += data
performance_test(1000, data2)  

data3 = []
for i in range(100):
    data3 += data
performance_test(1000, data3) 

data4 = []
for i in range(1000):
    data4 += data
performance_test(1000, data4)

Performance Output

4 :  0.09153580665588379
40 :  0.6010794639587402
400 :  5.142168045043945
4000 :  50.09365963935852

So it's linear time in n (data size). I updated from my first answer the constant from "updated sum" to "maximum weight of all data items" But sure it depends on the max_weight konstant. If someone has a strategy to update max_weight in a proper way, the performance would increase.