algorithmrandomshuffleweighted

What is an algorithm for weighted shuffling that scales well?


Setup

Items in a set have a weight property that can be set by an admin user. Items with more weight are more likely to achieve top position in a randomized list served to public visitors. Every item has a chance (however miniscule) to appear in any position.

What algorithm would ensure that changing an item’s weight would achieve a consistent result regardless of set size?

I know I’m out of my depth, but I don’t want to assume that the actual probability structure implied by the basic algorithm (Attempt 1) is valid. Am I tilting at windmills here?* Is my method flawed? Are my conclusions solid?

Parameters for success

  1. A heavier weight makes a higher position more likely.
  2. Each item must have at least a small chance of appearing in any position on the list. (This rules out a “rank class” or “rank quota” type of approach, which could end up mandating that items of a low rank never appear, or a high rank containing a single item always appear.)
  3. Set size should not affect one weight’s effect relative to another weight. Set size will of course affect a single item’s overall chances. I mean that in the set [a:2, b:2, c:2, d:2, e:3, f:3], the difference in chance between b and e will be the same as in the set [a:2, b:2, c:3, d:3, e:3, f:3] (same length, different sum) and the same as in the set [a:1, b:2, e:3, f:6] (different length, same sum).
  4. Only the difference in weight matters, not the actual amount of weight. In the set [3,3,17,17], bumping up the last item to 19 should have as much effect as bumping the first item up to 5; and the set [1,3] should behave similarly to [3,5]. This is so that users don’t have to understand the whole set to make an efficient change.

Attempts

1: Linear weights (standard method)

function linear_ratio($weight_set, $weight) {
    $sum = 0;
    foreach ($weight_set as $item) $sum += $item;
    return $weight / $sum;
}
  1. [PASS] A heavier weight makes a higher position more likely.
  2. [PASS] Heaviest weight still has a chance of bottom position.
  3. [PASS] Weights 3 and 3+2 have the same relative probabilities regardless of how large the set is.
  4. [FAIL] 3→5 is a 167% change, but 17→19 is a 112% change.

2: Exponential weights

function exponential_ratio($weight_set, $weight) {
    $sum = 0;
    foreach ($weight_set as $item) $sum += (2 ** $item);
    $weight = 2 ** $weight;
    return $weight / $sum;
}

This solves the 4th parameter, but the base (2) is arbitrary yet exponentially impactful.

  1. [PASS] A heavier weight makes a higher position more likely.
  2. [PASS] Heaviest weight still has a chance of bottom position.
  3. [PASS] Weights 3 and 3+2 have the same relative probabilities regardless of how large the set is.
  4. [PASS] 3→5 is a 400% change, and 17→19 is still a 400% change.

3: Set size as a parameter

function based_exponential_ratio($weight_set, $weight) {
    $set_size = count($weight_set);
    $sum = 0;
    foreach ($weight_set as $item) $sum += $set_size ** $item;
    $weight = $set_size ** $item;
    return $weight / $sum;
}

This just switches out one problem (the arbitrary base) for another. Given a set of 100 items, adding 1 more shouldn’t make much difference, but set size + 1 increases weight’s effect exponentially (thus reducing randomization).

4: Magnitude of set size as a parameter

I do care about adding 500 more items to the set of 100. So do I care when set size changes by an order of magnitude? “Take the natural log of the set size and raise it to the weight power”? My mental model can’t progress any further than this point, but I would have thought that taking an exponent of a log would be counter-productive.


Solution

  • The exponential approach you have is the most reasonable given your conditions.

    The way you are stating your final condition is not realistic. Generally, examining the percentage change in probability is not a useful way of looking at it.

    [FAIL] If the set is [2,2,13,13], incrementing the lightest value by 1 changes it from 4/16392 to 8/16400—a 199.9% change. Incrementing the heaviest value by 1 changes it from 8192/16392 to 16384/24584—only a 133.4% change.

    What if you have two products, and your initial weights are set to produce starting probabilities of [99/100, 1/100]? Then suppose there is a real number dw such that incrementing the weight of the second product by dw takes the probabilities to [95/100, 5/100]. That's a "400% increase" in the probability of the second product. There's no way to increase the probability of the first product in a remotely similar way.