Performance of reservoir sampling vs. getting the length of a list and picking random elements

I have written two functions to pick a random element out of a list of unknown length. The first uses reservoir sampling (with a reservoir of size 1), and the second gets the length of the list to pick a random index and return it. For some reason, the former is much faster.

The first function uses a single traversal and pick each element with probability (1/i), where i is the index of the element in the list. It results in a equal probability of picking each element.

pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
  stdgen <- newStdGen
  return (pickRandom' xs x 1 stdgen)

-- Pick a random number using reservoir sampling
pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a
pickRandom' [] xi _ _ = xi
pickRandom' (x:xs) xi n gen =
  let (rand, gen') = randomR (0, n) gen in
  if (rand == 0) then
    pickRandom' xs x (n + 1) gen' -- Update value
  else
    pickRandom' xs xi (n + 1) gen' -- Keep previous value

The second version traverses the list once to get its length, and then picks an index between 0 and the length of the input list (-1) to get one of the element, again with equal probability. The expected number of traversal of the list 1.5:

-- Traverses the list twice
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
  gen <- newStdGen
  (e, _) <- return $ randomR (0, (length xs) - 1) gen
  return $ xs !! e

Here is the code I use for benchmarking these two functions:

main :: IO ()
main = do
  gen <- newStdGen
  let size = 2097152
      inputList = getRandList gen size
  defaultMain [ bench "Using length" (pickRandomWithLen inputList)
              , bench "Using reservoir" (pickRandom inputList)
              ]

Here is a stripped output:

benchmarking Using reservoir
mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950

benchmarking Using length
mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950

In other terms, the first function is about 200 times faster than the second. I expected the runtime to be influenced mainly by random number generation and the number of list traversals (1 vs. 1.5). What other factors can explain such a huge difference?

Solution

Your benchmarked actions don't actually evaluate the result,

pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
  stdgen <- newStdGen
  return (pickRandom' xs x 1 stdgen)

only gets a new StdGen and returns a thunk. That's pretty immediate.

pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
  gen <- newStdGen
  (e, _) <- return $ randomR (0, (length xs) - 1) gen
  return $ xs !! e

computes the length of the list and then returns a thunk, that is of course much slower.

Forcing both to evaluate the result,

return $! ...

makes the length using version much faster,

benchmarking Using length
mean: 14.65655 ms, lb 14.14580 ms, ub 15.16942 ms, ci 0.950
std dev: 2.631668 ms, lb 2.378186 ms, ub 2.937339 ms, ci 0.950
variance introduced by outliers: 92.581%
variance is severely inflated by outliers

benchmarking Using reservoir
collecting 100 samples, 1 iterations each, in estimated 47.00930 s
mean: 451.5571 ms, lb 448.4355 ms, ub 455.7812 ms, ci 0.950
std dev: 18.50427 ms, lb 14.45557 ms, ub 24.74350 ms, ci 0.950
found 4 outliers among 100 samples (4.0%)
  2 (2.0%) high mild
  2 (2.0%) high severe
variance introduced by outliers: 38.511%
variance is moderately inflated by outliers

(after forcing the input list to be evaluated before by printing its sum), because that needs only one call to the PRNG, while the reservoir sampling uses length list - 1 calls.

The difference would probably be smaller if a faster PRNG than a StdGen is used.

Indeed, using System.Random.Mersenne instead of StdGen (requires that pickRandom' has IO a result type, and since it offers no generation in a specific range but only default range skews the distribution of picked elements a little, but since we're only interested in the time needed for the pseudo-random number generation, that's not important), the time for the reservoir sampling drops to

mean: 51.83185 ms, lb 51.77620 ms, ub 51.91259 ms, ci 0.950
std dev: 482.4712 us, lb 368.4433 us, ub 649.1758 us, ci 0.950

(the pickRandomWithLen time doesn't change measurably, of course, since it uses only one generation). A roughly nine-fold speedup, that shows that the pseudo-random generation is the dominant factor.