coptimizationcachingcpumicro-optimization

Use two loop bodies or one (result identical)?


I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.

In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?

Assuming:

  1. Cost of a f and g call is negligible compared to cost of the loop

  2. f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)

  3. Intel Core Duo CPU

  4. C language source code

  5. The GCC compiler, "no extra switches"

I want answers outside the "premature optimization is evil" character, if possible.

An example of the two-loops version that I am advocating for:

int j = 0, k = 0;

for(int i = 0; i < 1000000; i++) {
    j += f(i);
}

for(int i = 0; i < 1000000; i++) {
    k += g(i);
}

Solution

  • I can see three variables (even in a seemingly simple chunk of code):

    A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.