I need to iterate over lots of (2D) data and only sometimes handle special cases. For my application speed is the most critical factor.
The options that quickly come to (my) mind are:
Option A:
void ifInLoop(bool specialCase, MyClass &acc) {
for (auto i = 0; i < n; ++i) {
for (auto j = 0; j < n; ++j) {
if (specialCase) {
acc.foo();
} else {
acc.bar();
}
}
}
}
Option B:
void loopsInIf(bool specialCase, MyClass &acc) {
if (specialCase) {
for (auto i = 0; i < n; ++i) {
for (auto j = 0; j < n; ++j) {
acc.foo();
}
}
} else {
for (auto i = 0; i < n; ++i) {
for (auto j = 0; j < n; ++j) {
acc.bar();
}
}
}
}
Option C:
template <bool specialCase>
void templateIf(MyClass &acc) {
for (auto i = 0; i < n; ++i) {
for (auto j = 0; j < n; ++j) {
if (specialCase) {
acc.foo();
} else {
acc.bar();
}
}
}
}
I know this falls under premature Optimization. However, from a theoretical point of view, I would be interested in the differences of these snippets when compiled with -O3
(GCC / Clang) with respect to produced assembly and speed.
(There already exists a similar question about this in Perl but I would like to know about C++ specifically.)
(EDIT) Is specialCase
known at compile time?
Not really. The call itself is in another loop and some iterations are treated differently. So something like (but not necessarily equidistant however independent of the user input):
for (int i = 0; i < m; ++i) {
ifInLoop(i % 10, acc);
}
How would I use Option C here? Introducing an extra if, therefore I would expect it to be very similar to B.
for (int i = 0; i < m; ++i) {
if (i % 10)
templateIf<true>(acc);
else
templateIf<false>(acc);
}
If this function can inline into a caller that passes a compile-time-constant bool
, you're fine with option A (as long as the function is small enough to inline). i.e. if a template arg is possible, you don't usually actually need it. Except when if forces you to write if(var) { foo<true>(arg); }else {foo<false>(arg); }
to encourage the compiler to make asm with 2 versions of the loop.
All modern compilers are smart enough to inline small functions and then completely optimize away an if(constant)
. Inlining + constant-propagation
are what make modern C++ possible to compile efficiently.
But if the bool value is not known at compile-time, option B is likely more efficient. (If the function doesn't run often its speed might not matter in the big picture, and the different might be small.)
It's a tradeoff between static code size (I-cache footprint) vs. dynamic instruction count. Or if the special case rarely runs, that version of the loop might stay cold in cache.
for (int i = 0; i < m; ++i) { ifInLoop(i % 10, acc); }
If you really do have a repeating pattern like this, the compiler might decide to unroll this loop for you so the bool becomes a compile-time constant.
Or you could hand-hold the compiler into maybe making nicer asm if it doesn't decide to invent new inner-loops itself, and an unroll by 10 of a loop containing another whole loop is too much for the compiler's heuristics.
int i;
for (i = 0; i < m-9; i+=10) { // potentially runs zero times if signed m <= 9
ifInLoop(false, acc); // this is the j=0
for (int j=1; j<10 ; j++) // j = i%10
ifInLoop(true, acc); // original i = i+j in case you need it
}
// cleanup loop:
for ( ; i < m ; i++) {
ifInLoop(i % 10, acc);
}
Predicting perfectly doesn't get rid of front-end + back-end throughput cost of the instructions that check the branch condition, if the compiler doesn't hoist the if and generate two versions of the loop.
There could be significant simplifications / optimizations possible if the compiler knows that only one of the if
or else
bodies runs every iteration, but checking at runtime and branching misses out on those optimizations even if it predicts perfectly.
The usual Stack Overflow response of "profile it" is not as useful as most people seem to think. First of all, microbenchmarking is hard. It's very easy to measure the wrong thing entirely, or draw nonsense conclusions because you don't know enough about what could possibly matter and what doesn't. (Make sure to warm up your CPU to max turbo frequency and init the memory first so you don't have CoW mapping to the zero page and the first timed pass isn't paying page-fault + TLB-miss costs. Compile with optimization enabled, and check that performance scales linearly with your repeat count.)
Profiling one test case won't tell you the cost in general. Which optimizations you miss out on, and whether the compiler is willing to split the loop for you and hoist the branch, depend on the details of the loop (perhaps including how complex the loop body is).
Looking at the asm for your specific case with the compilers you care about is the only way to be sure.
Different compilers (or different versions of the same compiler, or with different tuning options like gcc -mtune=generic
vs. gcc -mtune=skylake
) can certainly make a difference in whether the compiler decides to invert/split the loop to select once between two loops. Tune options set heuristic constants for decisions like this and loop unrolling where there's a tradeoff between static code-size and dynamic instruction-count.
Part of that may depend on how much work is outside the if()
and would have to get duplicated unchanged when splitting.