[SOLVED] Why only minor change to function design radically changes result of criterion benchmark?

Why only minor change to function design radically changes result of criterion benchmark?

I have two source files which are doing roughly the same. The only difference is that in the first case function is passed as a parameter and in the second one - value.

First case:

module Main where

import Data.Vector.Unboxed as UB
import qualified Data.Vector as V

import Criterion.Main

regularVectorGenerator :: (Int -> t) -> V.Vector t
regularVectorGenerator = V.generate 99999

unboxedVectorGenerator :: Unbox t => (Int -> t) -> UB.Vector t
unboxedVectorGenerator = UB.generate 99999

main :: IO ()
main = defaultMain
    [
        bench "boxed"   $ whnf regularVectorGenerator (+2137)
      , bench "unboxed" $ whnf unboxedVectorGenerator (+2137)
    ]

Second case:

module Main where

import Data.Vector.Unboxed as UB
import qualified Data.Vector as V

import Criterion.Main

regularVectorGenerator :: Int -> V.Vector Int
regularVectorGenerator = flip V.generate (+2137)

unboxedVectorGenerator :: Int -> UB.Vector Int
unboxedVectorGenerator = flip UB.generate (+2137)

main :: IO ()
main = defaultMain
    [
        bench "boxed"   $ whnf regularVectorGenerator 99999
      , bench "unboxed" $ whnf unboxedVectorGenerator 99999
    ]

What I noticed that during benchamrking size of vector the unboxed is, as expected, always smaller yet size of both vectors vary drasticlly. Here is output of

first case:

 benchmarking boxed
 time                 7.626 ms   (7.515 ms .. 7.738 ms)
                     0.999 R²   (0.998 R² .. 0.999 R²)
 mean                 7.532 ms   (7.472 ms .. 7.583 ms)
 std dev              164.3 μs   (133.8 μs .. 201.3 μs)
 allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
   iters              **1.680e7**    (1.680e7 .. 1.680e7)
   y                  2357.390   (1556.690 .. 3422.724)

 benchmarking unboxed
 time                 889.1 μs   (878.9 μs .. 901.8 μs)
                     0.998 R²   (0.995 R² .. 0.999 R²)
 mean                 868.6 μs   (858.6 μs .. 882.6 μs)
 std dev              39.05 μs   (28.30 μs .. 57.02 μs)
 allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
   iters              **4000009.003** (4000003.843 .. 4000014.143)
   y                  2507.089   (2025.196 .. 3035.962)
 variance introduced by outliers: 36% (moderately inflated)

and the second case:

 benchmarking boxed
 time                 1.366 ms   (1.357 ms .. 1.379 ms)
                     0.999 R²   (0.998 R² .. 1.000 R²)
 mean                 1.350 ms   (1.343 ms .. 1.361 ms)
 std dev              29.96 μs   (21.74 μs .. 43.56 μs)
 allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
   iters              **2400818.350** (2400810.284 .. 2400826.685)
  y                  2423.216   (1910.901 .. 3008.024)
 variance introduced by outliers: 12% (moderately inflated)

 benchmarking unboxed
 time                 61.30 μs   (61.24 μs .. 61.37 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
 mean                 61.29 μs   (61.25 μs .. 61.33 μs)
 std dev              122.1 ns   (91.64 ns .. 173.9 ns)
 allocated:           1.000 R²   (1.000 R² .. 1.000 R²)
   iters              **800040.029** (800039.745 .. 800040.354)
   y                  2553.830   (2264.684 .. 2865.637)

Benchameked size of vector decreased by order of magnitude just by de-parametrizing function. Can someone explain me why?

I compiled both exaples with those flags:

-O2 -rtsopts

and launched with

--regress allocated:iters +RTS -T

Solution

The difference is that if the generating function is already known in the benchmarked function, the generator is inlined and the involved Int-s are unboxed as well. If the generating function is the benchmark parameter, it cannot be inlined.

From the benchmarking perspective the second version is the correct one, since in normal usage we want the generating function to be inlined.