I have written two functions to pick a random element out of a list of unknown length. The first uses reservoir sampling (with a reservoir of size 1), and the second gets the length of the list to pick a random index and return it. For some reason, the former is much faster.
The first function uses a single traversal and pick each element with probability (1/i), where i is the index of the element in the list. It results in a equal probability of picking each element.
pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
stdgen <- newStdGen
return (pickRandom' xs x 1 stdgen)
-- Pick a random number using reservoir sampling
pickRandom' :: (RandomGen g) => [a] -> a -> Int -> g -> a
pickRandom' [] xi _ _ = xi
pickRandom' (x:xs) xi n gen =
let (rand, gen') = randomR (0, n) gen in
if (rand == 0) then
pickRandom' xs x (n + 1) gen' -- Update value
else
pickRandom' xs xi (n + 1) gen' -- Keep previous value
The second version traverses the list once to get its length, and then picks an index between 0 and the length of the input list (-1) to get one of the element, again with equal probability. The expected number of traversal of the list 1.5:
-- Traverses the list twice
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
gen <- newStdGen
(e, _) <- return $ randomR (0, (length xs) - 1) gen
return $ xs !! e
Here is the code I use for benchmarking these two functions:
main :: IO ()
main = do
gen <- newStdGen
let size = 2097152
inputList = getRandList gen size
defaultMain [ bench "Using length" (pickRandomWithLen inputList)
, bench "Using reservoir" (pickRandom inputList)
]
Here is a stripped output:
benchmarking Using reservoir
mean: 82.72108 ns, lb 82.02459 ns, ub 83.61931 ns, ci 0.950
benchmarking Using length
mean: 17.12571 ms, lb 16.97026 ms, ub 17.37352 ms, ci 0.950
In other terms, the first function is about 200 times faster than the second. I expected the runtime to be influenced mainly by random number generation and the number of list traversals (1 vs. 1.5). What other factors can explain such a huge difference?
Your benchmarked actions don't actually evaluate the result,
pickRandom :: [a] -> IO a
pickRandom [] = error "List is empty"
pickRandom (x:xs) = do
stdgen <- newStdGen
return (pickRandom' xs x 1 stdgen)
only gets a new StdGen
and returns a thunk. That's pretty immediate.
pickRandomWithLen :: [a] -> IO a
pickRandomWithLen [] = error "List is empty"
pickRandomWithLen xs = do
gen <- newStdGen
(e, _) <- return $ randomR (0, (length xs) - 1) gen
return $ xs !! e
computes the length of the list and then returns a thunk, that is of course much slower.
Forcing both to evaluate the result,
return $! ...
makes the length
using version much faster,
benchmarking Using length
mean: 14.65655 ms, lb 14.14580 ms, ub 15.16942 ms, ci 0.950
std dev: 2.631668 ms, lb 2.378186 ms, ub 2.937339 ms, ci 0.950
variance introduced by outliers: 92.581%
variance is severely inflated by outliers
benchmarking Using reservoir
collecting 100 samples, 1 iterations each, in estimated 47.00930 s
mean: 451.5571 ms, lb 448.4355 ms, ub 455.7812 ms, ci 0.950
std dev: 18.50427 ms, lb 14.45557 ms, ub 24.74350 ms, ci 0.950
found 4 outliers among 100 samples (4.0%)
2 (2.0%) high mild
2 (2.0%) high severe
variance introduced by outliers: 38.511%
variance is moderately inflated by outliers
(after forcing the input list to be evaluated before by printing its sum), because that needs only one call to the PRNG, while the reservoir sampling uses length list - 1
calls.
The difference would probably be smaller if a faster PRNG than a StdGen
is used.
Indeed, using System.Random.Mersenne
instead of StdGen
(requires that pickRandom'
has IO a
result type, and since it offers no generation in a specific range but only default range skews the distribution of picked elements a little, but since we're only interested in the time needed for the pseudo-random number generation, that's not important), the time for the reservoir sampling drops to
mean: 51.83185 ms, lb 51.77620 ms, ub 51.91259 ms, ci 0.950
std dev: 482.4712 us, lb 368.4433 us, ub 649.1758 us, ci 0.950
(the pickRandomWithLen
time doesn't change measurably, of course, since it uses only one generation). A roughly nine-fold speedup, that shows that the pseudo-random generation is the dominant factor.