c++performance memory-bandwidth roofline

Roofline model: calculating operational intensity

Say I have a toy loop like this

float x[N];
float y[N];
for (int i = 1; i < N-1; i++)
    y[i] = a*(x[i-1] - x[i] + x[i+1])

And I assume my cache line is 64 Byte (i.e. big enough). Then I will have (per frame) basically 2 accesses to the RAM and 3 FLOP:

1 (cached) read access: loading all 3 x[i-1], x[i], x[i+1]
1 write access: storing y[i]
3 FLOP (1 mul, 1 add, 1 sub)

The operational intensity is ergo

OI = 3 FLOP/(2 * 4 BYTE)

Now what happens if I do something like this

float x[N];
for (int i = 1; i < N-1; i++)
    x[i] = a*(x[i-1] - x[i] + x[i+1])

Note that there is no y anymore. Does it mean now that I have a single RAM access

1 (cached) read/write: loading x[i-1], x[i], x[i+1], storing x[i]

or still 2 RAM accesses

1 (cached) read: loading x[i-1], x[i], x[i+1]
1 (cached) write: storing x[i]

Because the operational intensity OI would be different in either case. Can anyone tell something about this? Or maybe clarify some things. Thanks

Solution

Disclaimer: I've never heard of the roofline performance model until today. As far as I can tell, it attempts to calculate a theoretical bound on the "arithmetic intensity" of an algorithm, which is the number of FLOPS per byte of data accessed. Such a measure may be useful for comparing similar algorithms as the size of N grows large, but is not very helpful for predicting real-world performance.

As a general rule of thumb, modern processors can execute instructions much more quickly than they can fetch/store data (this becomes drastically more pronounced as the data starts to grow larger than the size of the caches). So contrary to what one might expect, a loop with higher arithmetic intensity may run much faster than a loop with lower arithmetic intensity; what matters most as N scales is the total amount of data touched (this will hold true as long as memory remains significantly slower than the processor, as is true in common desktop and server systems today).

In short, x86 CPUs are unfortunately too complex to be accurately described with such a simple model. An access to memory goes through several layers of caching (typically L1, L2, and L3) before hitting RAM. Maybe all your data fits in L1 -- the second time you run your loop(s) there could be no RAM accesses at all.

And there's not just the data cache. Don't forget that code is in memory too and has to be loaded into the instruction cache. Each read/write is also done from/to a virtual address, which is supported by the hardware TLB (that can in extreme cases trigger a page fault and, say, cause the OS to write a page to disk in the middle of your loop). All of this is assuming your program is hogging the hardware all to itself (in non-realtime OSes this is simply not the case, as other processes and threads are competing for the same limited resources).

Finally, the execution itself is not (directly) done with memory reads and writes, but rather the data is loaded into registers first (then the result is stored).

How the compiler allocates registers, if it attempts loop unrolling, auto-vectorization, the instruction scheduling model (interleaving instructions to avoid data dependencies between instructions) etc. will also all affect the actual throughput of the algorithm.

So, finally, depending on the code produced, the CPU model, the amount of data processed, and the state of various caches, the latency of the algorithm will vary by orders of magnitude. Thus, the operational intensity of a loop cannot be determined by inspecting the code (or even the assembly produced) alone, since there are many other (non-linear) factors in play.

To address your actual question, though, as far as I can see by the definition outlined here, the second loop would count as a single additional 4-byte access per iteration on average, so its OI would be θ(3N FLOPS / 4N bytes). Intuitively, this makes sense because the cache already has the data loaded, and the write can change the cache directly instead of going back to main memory (the data does eventually have to be written back, however, but that requirement is unchanged from the first loop).