I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Assuming:
Cost of a f
and g
call is negligible compared to cost of the loop
f
and g
use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"
I want answers outside the "premature optimization is evil" character, if possible.
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++) {
j += f(i);
}
for(int i = 0; i < 1000000; i++) {
k += g(i);
}
I can see three variables (even in a seemingly simple chunk of code):
f()
and g()
do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:f()
and g()
operate on large amounts of data, according to i
? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache or TLB misses.f()
and g()
are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k
would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.