algorithmhaskellrandomsamplingreservoir-sampling

Infinite/Lazy Reservoir Sampling in Haskell


I tried to implement a simple reservoir sampling in haskell following http://jeremykun.com/2013/07/05/reservoir-sampling/ (note that the algorithm shown is possibly semantically incorrect)

According to this: Iterative or Lazy Reservoir Sampling lazy reservoir sampling is impossible unless you know the population size ahead of time.

Even so, I'm not understanding why (operationally speaking) the below sampleReservoir doesn't work on infinite lists. Just where exactly is laziness broken?

import System.Random (randomRIO)

-- equivalent to python's enumerate
enumerate :: (Num i, Enum i) => i -> [e] -> [(i, e)]
enumerate start = zip [start..]

sampleReservoir stream = 
    foldr 
        (\(i, e) reservoir -> do 
            r <- randomRIO (0.0, 1.0) :: IO Double
                              -- randomRIO gets confused about 0.0 and 1.0
            if r < (1.0 / fromIntegral i) then
                fmap (e:) reservoir
            else 
                reservoir) 
        (return []) 
        (enumerate 1 stream)

The challenge and test is fmap (take 1) $ sampleReservoir [1..].

Furthermore, if reservoir sampling can't be lazy, what can take in a lazy list and produce a sampled lazy list?

I get the idea that there must be a way of making the above function lazy in the output as well, because I could change this:

if r < (1.0 / fromIntegral i) then
    fmap (e:) reservoir
else 
    

To:

if r < (1.0 / fromIntegral i) then
    do 
        print e
        fmap (e:) reservoir

This shows results as the function is iterating over the list. Using coroutine abstraction, perhaps instead of print e there can be a yield e, and the rest of the computation can be held as a continuation.


Solution

  • The problem is that the IO monad maintains a strict sequence between actions. Writing fmap (e:) reservoir will first execute all of the effects associated with reservoir, which will be infinite if the input list is infinite.

    I was able to fix this with liberal use of unsafeInterleaveIO, which allows you to break the semantics of IO:

    sampleReservoir2 :: [e] -> IO [e]
    sampleReservoir2 stream = 
        foldr 
            (\(i, e) reservoir -> do 
                r <- unsafeInterleaveIO $ randomRIO (0.0, 1.0) :: IO Double -- randomRIO gets confused about 0.0 and 1.0
                if r < (1.0 / fromIntegral i) then unsafeInterleaveIO $ do
                    rr <- reservoir
                    return (e:rr)
                else 
                    reservoir) 
            (return []) 
            (enumerate 1 stream)
    

    Obviously, this will allow the interleaving of IO actions, but since all you're doing is generating random numbers it shouldn't matter. However, this solution isn't very satisfactory; the correct solution is to refactor your code somewhat. You should generate an infinite list of random numbers, then consume that infinite list (lazily) with foldr:

    sampleReservoir3 :: MonadRandom m => [a] -> m [a]
    sampleReservoir3 stream = do
      ws <- getRandomRs (0, 1 :: Double) 
      return $ foldr 
         (\(w, (i, e)) reservoir -> 
            (if w < (1 / fromIntegral i) then (e:) else id) reservoir
         ) 
         []
         (zip ws $ enumerate 1 stream)
    

    This can also (equivalently) be written as

    sampleReservoir4 :: [a] -> IO [a] 
    sampleReservoir4 stream = do
      seed <- newStdGen 
      let ws = randomRs (0, 1 :: Double) seed 
      return $ foldr 
         (\(w, (i, e)) reservoir -> 
            (if w < (1 / fromIntegral i) then (e:) else id) reservoir
         ) 
         []
         (zip ws $ enumerate 1 stream)
    

    As an aside, I'm not sure as to the correctness of the algorithm, since it seems to always return the first element of the input list first. Not very random.