I'm trying to optimize some code, using criterion to try to compare, for example, the effect of adding INLINE
pragma to a function. But I'm finding results are not consistent between re-compiles/runs.
I need to know how to get results either to be consistent across runs so that I can compare them, or how to assess whether a benchmark is reliable or not, i.e. (I guess) how to interpret the details about variance, "cost of a clock call", etc.
This is orthogonal to my main questions above, but a couple things might be causing inconsistency in my particular case:
I'm trying to benchmark IO
actions using whnfIO
because the method using whnf
in this example didn't work.
my code uses concurrency
I've got a lot of tabs and crap open
Both of these are from the same code, compiled in the exact same way. I did the first run directly below, made a change and did another benchmark, then reverted and ran the first code again, compiling with:
ghc --make -fforce-recomp -threaded -O2 Benchmark.hs
First run:
estimating clock resolution...
mean is 16.97297 us (40001 iterations)
found 6222 outliers among 39999 samples (15.6%)
6055 (15.1%) high severe
estimating cost of a clock call...
mean is 1.838749 us (49 iterations)
found 8 outliers among 49 samples (16.3%)
3 (6.1%) high mild
5 (10.2%) high severe
benchmarking actors/insert 1000, query 1000
collecting 100 samples, 1 iterations each, in estimated 12.66122 s
mean: 110.8566 ms, lb 108.4353 ms, ub 113.6627 ms, ci 0.950
std dev: 13.41726 ms, lb 11.58487 ms, ub 16.25262 ms, ci 0.950
found 2 outliers among 100 samples (2.0%)
2 (2.0%) high mild
variance introduced by outliers: 85.211%
variance is severely inflated by outliers
benchmarking actors/insert 1000, query 100000
collecting 100 samples, 1 iterations each, in estimated 945.5325 s
mean: 9.319406 s, lb 9.152310 s, ub 9.412688 s, ci 0.950
std dev: 624.8493 ms, lb 385.4364 ms, ub 956.7049 ms, ci 0.950
found 6 outliers among 100 samples (6.0%)
3 (3.0%) low severe
1 (1.0%) high severe
variance introduced by outliers: 62.576%
variance is severely inflated by outliers
Second run, ~3x slower:
estimating clock resolution...
mean is 51.46815 us (10001 iterations)
found 203 outliers among 9999 samples (2.0%)
117 (1.2%) high severe
estimating cost of a clock call...
mean is 4.615408 us (18 iterations)
found 4 outliers among 18 samples (22.2%)
4 (22.2%) high severe
benchmarking actors/insert 1000, query 1000
collecting 100 samples, 1 iterations each, in estimated 38.39478 s
mean: 302.4651 ms, lb 295.9046 ms, ub 309.5958 ms, ci 0.950
std dev: 35.12913 ms, lb 31.35431 ms, ub 42.20590 ms, ci 0.950
found 1 outliers among 100 samples (1.0%)
variance introduced by outliers: 84.163%
variance is severely inflated by outliers
benchmarking actors/insert 1000, query 100000
collecting 100 samples, 1 iterations each, in estimated 2644.987 s
mean: 27.71277 s, lb 26.95914 s, ub 28.97871 s, ci 0.950
std dev: 4.893489 s, lb 3.373838 s, ub 7.302145 s, ci 0.950
found 21 outliers among 100 samples (21.0%)
4 (4.0%) low severe
3 (3.0%) low mild
3 (3.0%) high mild
11 (11.0%) high severe
variance introduced by outliers: 92.567%
variance is severely inflated by outliers
I notice that if I scale by "estimated cost of a clock call" the two benchmarks are fairly close. Is that what I should do to get a real number for comparing?
Although there's certainly not enough information here to pinpoint every issue, I have a few suggestions that may help.
The problem with the samples identified as outliers is that criterion can't really tell if they're outliers because they're junk data, or if they're valid data that's different for some legitimate reason. It can strongly hint that they're junk (the "variance is severely inflated" line), but what this really means is that you need to investigate your testing environment, your tests, or your application itself to determine the source of the outliers. In this case it's almost certainly caused by system load (based on other information you've provided).
You might be interested to read BOS's announcement of criterion, which explains how it works in quite a bit more detail and goes through some examples of exactly how system load affects the benchmarking process.
I'm very suspicious of the difference in the "estimated cost of a clock call". Notice that there is a high proportion of outliers (in both runs), and those outliers have a "high severe" impact. I would interpret this to mean that the clock timings criterion picked up are junk (probably in both runs), making everything else unreliable too. As @DanielFischer suggests, closing other applications may help this problem. Worst case might be a hardware problem. If you close all other applications and the clock timings are still unreliable, you may want to test on another system.
If you're running multiple tests on the same system, the clock timings and cost should be fairly consistent from run to run. If they aren't, something is affecting the timings, resulting in unreliable data.
Aside from that, here are two random ideas that may be a factor.
The threaded runtime can be sensitive to CPU load. The default RTS values work well for many applications unless your system is under heavy load. The problem is that there are a few critical sections in the garbage collector, so if the Haskell runtime is resource starved (because it's competing for CPU or memory with other applications), all progress can be blocked waiting for those sections. I've seen this affect performance by a factor of 2.5, which is more or less in line with the three-fold difference you see.
Even if you don't have issues with the garbage collector, high CPU load from other applications will skew your results and should be eliminated if possible.
how to diagnose
top
or other system utilities to check CPU load.+RTS -s
. At the bottom of the statics, look for these lines-RTS -s output
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
non-zero values indicate resource contention in the garbage collector. Large values here indicate a serious problem.
how to fix
+RTS -N6
or +RTS -N7
on an 8-core box)+RTS -qg
). I've usually had better results by leaving a free core than disabling the parallel collector, but YMMV.If the functions you're benchmarking are doing any sort of I/O (disk, network, etc.), you need to be very careful in how you interpret the results. Disk I/O can cause huge variances in performance. If you run the same function for 100 samples, after the first run any I/O might be cached by the controller. Or you may have to do a head seek if another file was accessed between runs. Other I/O typically isn't any better.
how to diagnose
lsof
can help track down mysterious I/O performancehow to fix