I'm trying to measure the performance of a simple Haar DWT program using the Criterion framework. (It is erroneously slow, but I'll leave that for another question). I can't find any good documentation on the web, unfortunately. My two primary problems are
This source is relatively pared down; the first function getRandList
generates a list of random numbers; haarStep
transforms an input signal into differences and sums, and haarDWT
calls the former and recurses on the sums. I'm trying to pass the getRandList
to the haarDWT
via lazy evaluation, but perhaps my usage is incorrect / unsupported. The timings don't seem to make sense.
{-# LANGUAGE ViewPatterns #-}
import Control.Arrow
import qualified Data.Vector.Unboxed as V
import System.Random
import Criterion.Main
invSqrt2 = 0.70710678118654752440
getRandList :: RandomGen g => g -> Int -> [Float]
getRandList gen 0 = []
getRandList gen n = v:rest where
(v, gen') = random gen
rest = getRandList gen' (n - 1)
haarStep :: V.Vector Float -> (V.Vector Float, V.Vector Float)
haarStep = (alternatingOp (-) &&& alternatingOp (+)) where
alternatingOp op x = V.generate (V.length x `div` 2) (\i ->
((x V.! (2 * i)) `op` (x V.! (2 * i + 1))) * invSqrt2)
haarDWT :: V.Vector Float -> V.Vector Float
haarDWT xl@(V.length -> 1) = xl
haarDWT (haarStep -> (d, s)) = haarDWT s V.++ d
main = do
gen <- getStdGen
inData <- return $ getRandList gen 2097152
outData <- return $ haarDWT (V.fromList inData)
defaultMain [
bench "get input" $ nf id inData,
bench "transform" $ nf V.toList outData
]
writeFile "input.dat" (unlines $ map show inData)
writeFile "output.dat" (unlines $ map show $ V.toList outData)
Finally, I'm getting an error when I try to call it with -s 1
; maybe this is just a Criterion bug.
Main: ./Data/Vector/Generic.hs:237 ((!)): index out of bounds (1,1)
Thanks in advance!
The posted benchmark is erroniously slow... or is it
Are you sure it's erroneous? You're touching (well, the "nf" call is touching) 2 million boxed elements - thats 4 million pointers. You can call this erroneous if you want, but the issue is just what you think you're measure compared to what you really are measuring.
Sharing Data Between Benchmarks
Data sharing can be accomplished through partial application. In my benchmarks I commonly have
let var = somethingCommon in
defaultMain [ bench "one" (nf (func1 somethingCommon) input1)
, bench "two" (nf (func2 somethingCommon) input2)]
Avoiding Reuse in the presences of lazy evaluation
Criterion avoids sharing by separating out your function and your input. You have signatures such as:
funcToBenchmark :: (NFData b) => a -> b
inputForFunc :: a
In Haskell every time you apply funcToBenchmark inputForFunc
it will create a thunk that needs evaluated. There is no sharing unless you use the same variable name as a previous computation. There is no automatic memoization - this seems to be a common misunderstanding.
Notice the nuance in what isn't shared. We aren't sharing the final result, but the input is shared. If the generation of the input is what you want to benchmark (i.e. getRandList, in this case) then benchmark that and not just the identity + nf function:
main = do
gen <- getStdGen
let inData = getRandList gen size
inVec = V.fromList inData
size = 2097152
defaultMain
[ bench "get input for real" $ nf (getRandList gen) size
, bench "get input for real and run harrDWT and listify a vector" $ nf (V.toList . haarDWT . V.fromList . getRandList gen) size
, bench "screw generation, how fast is haarDWT" $ whnf haarDWT inVec] -- for unboxed vectors whnf is sufficient
Interpreting Data
The third benchmark is rather instructive. Lets look at what criterion prints out:
benchmarking screw generation, how fast is haarDWT
collecting 100 samples, 1 iterations each, in estimated 137.3525 s
bootstrapping with 100000 resamples
mean: 134.7204 ms, lb 134.5117 ms, ub 135.0135 ms, ci 0.950
Based on a single run, Criterion thinks it will take 137 seconds to perform it's 100 samples. About ten seconds later it was done - what happened? Well, the first run forced all the inputs (inVec
), which was expensive. The subsequent runs found a value instead of a thunk, and thus we truely benchmarked haarDWT
and not the StdGen
RNG (which is known to be painfully slow).