benchmarkingmicrobenchmark

Idiomatic way of performance evaluation?


I am evaluating a network+rendering workload for my project.

The program continuously runs a main loop:

while (true) {
   doSomething()
   drawSomething()
   doSomething2()
   sendSomething()
}

The main loop runs more than 60 times per second.

I want to see the performance breakdown, how much time each procedure takes.

My concern is that if I print the time interval for every entrance and exit of each procedure,

It would incur huge performance overhead.

I am curious what is an idiomatic way of measuring the performance.

Printing of logging is good enough?


Solution

  • Generally: For repeated short things, you can just time the whole repeat loop. (But microbenchmarking is hard; easy to distort results unless you understand the implications of doing that; for very short things, throughput and latency are different, so measure both separately by making one iteration use the result of the previous or not. Also beware that branch prediction and caching can make something look fast in a microbenchmark when it would actually be costly if done one at a time between other work in a larger program. e.g. loop unrolling and lookup tables often look good because there's no pressure on I-cache or D-cache from anything else.)

    Or if you insist on timing each separate iteration, record the results in an array and print later; you don't want to invoke heavy-weight printing code inside your loop.

    This question is way too broad to say anything more specific.

    Many languages have benchmarking packages that will help you write microbenchmarks of a single function. Use them. e.g. for Java, JMH makes sure the function under test is warmed up and fully optimized by the JIT, and all that jazz, before doing timed runs. And runs it for a specified interval, counting how many iterations it completes. See How do I write a correct micro-benchmark in Java? for that and more.


    Beware common microbenchmark pitfalls

    Related to that last point: Don't tune only for huge inputs, if the real use-case for a function includes a lot of small inputs. e.g. a memcpy implementation that's great for huge inputs but takes too long to figure out which strategy to use for small inputs might not be good. It's a tradeoff; make sure it's good enough for large inputs (for an appropriate definition of "enough"), but also keep overhead low for small inputs.

    Litmus tests:


    For C / C++, see also Simple for() loop benchmark takes the same time with any loop bound where I went into some more detail about microbenchmarking and using volatile or asm to stop important work from optimizing away with gcc/clang.