My understanding of the L1 cache was that a memory fetch loads a cache line. Assuming the cache line size is 64 bytes, if I access memory at address p
, it will load the entire block from p
to p + 64
into the cache. Thus, it is best to iterate through an array from left to right (rather than right to left) to maximize cache locality.
However, I wrote example C code that allocates an array of 100 million chars, writes random values into it and sums it (copied below for reference). One version of the code sums from left to right and the other from right to left. When I benchmarked it, I got very simillar results (where "clock cycles" are measured in terms of clock
. The code was compiled with no optimizations.
So my question is: Do modern processors do something else than just "cache the read + 64 bytes"? Do they cache forwards and backwards? Can the compiler "tell" the processor the code is iterating backwards?
For reference, I am running on Mac OS X 10.13.3
using gcc-7 (Homebrew GCC 7.2.0_1) 7.2.0
and an x86-64 Intel processor with a cache line of 64 bytes.
Benchmakrs:
$ ./a.out
Backward Iterating...took 150101 clock cycles
$ ./a.out
Forward Iterating...took 146545 clock cycles
I would have expected forward iteration to be about 64x faster as every 64 elements should be a cache hit whereas with the backward iteration, every element should be a cache miss.
So, I called cachegrind on it. And the cache hit miss rate for both was pretty much the same:
# Left to right iteration
==21773==
==21773== I refs: 4,006,996,067
==21773== I1 misses: 5,183
==21773== LLi misses: 3,019
==21773== I1 miss rate: 0.00%
==21773== LLi miss rate: 0.00%
==21773==
==21773== D refs: 1,802,393,260 (1,401,627,925 rd + 400,765,335 wr)
==21773== D1 misses: 3,153,101 ( 1,588,104 rd + 1,564,997 wr)
==21773== LLd misses: 3,004,885 ( 1,440,660 rd + 1,564,225 wr)
==21773== D1 miss rate: 0.2% ( 0.1% + 0.4% )
==21773== LLd miss rate: 0.2% ( 0.1% + 0.4% )
==21773==
==21773== LL refs: 3,158,284 ( 1,593,287 rd + 1,564,997 wr)
==21773== LL misses: 3,007,904 ( 1,443,679 rd + 1,564,225 wr)
==21773== LL miss rate: 0.1% ( 0.0% + 0.4% )
# Right to left iteration
==21931==
==21931== I refs: 4,006,996,453
==21931== I1 misses: 5,198
==21931== LLi misses: 3,045
==21931== I1 miss rate: 0.00%
==21931== LLi miss rate: 0.00%
==21931==
==21931== D refs: 1,802,393,428 (1,401,628,038 rd + 400,765,390 wr)
==21931== D1 misses: 3,153,113 ( 1,588,079 rd + 1,565,034 wr)
==21931== LLd misses: 3,135,505 ( 1,571,219 rd + 1,564,286 wr)
==21931== D1 miss rate: 0.2% ( 0.1% + 0.4% )
==21931== LLd miss rate: 0.2% ( 0.1% + 0.4% )
==21931==
==21931== LL refs: 3,158,311 ( 1,593,277 rd + 1,565,034 wr)
==21931== LL misses: 3,138,550 ( 1,574,264 rd + 1,564,286 wr)
==21931== LL miss rate: 0.1% ( 0.0% + 0.4% )
Code:
#include <stdint.h>
#include <time.h>
#include <stdio.h>
#include <stdlib.h>
#define BUF_SIZE 100000000
int main() {
srand(time(NULL));
uint8_t *buf1 = (uint8_t *)malloc(BUF_SIZE);
// Fill the buf with random data
for (size_t i = 0; i < BUF_SIZE; ++i) {
buf1[i] = rand();
}
#ifdef BACKWARDS
printf("Backward Iterating...");
#else
printf("Forward Iterating...");
#endif
uint64_t sum = 0;
clock_t start = clock();
#ifdef BACKWARDS
for (size_t i = BUF_SIZE - 1; i != ~0; --i) {
#else
for (size_t i = 0; i < BUF_SIZE; ++i) {
#endif
sum += buf1[i];
}
clock_t end = clock();
printf("took %lu clock cycles\n", end - start);
printf("sum: %llu\n", sum);
free(buf1);
}
To expand on the previous answer:
Loading a full cache line granularity means that going forward or backward doesn't matter, once you hit one side of the line you get all of it. This is of course only applicable for cacheable loads and memtypes (+streaming cases that could be hit while still in the buffers).
However, this is not the full story. Modern CPUs employ hardware prefetchers that are very good in dealing with spatial locality - these would increase the granularity by prefetching additional cachelines in the same direction you're progressing. Exiting prefetchers depend on the exact architecture you're using, but common ones include next-line, adjacent-line (+/- 1 line), a stream of next lines, or IP-based strides. More info here .
These prefetchers should be symmetrical, but we don't know for sure (the exact details are not disclosed), they may have different chances or thresholds for different directions.
Another point is that cachegrind is just a cache simulation, it doesn't include effects like prefetching, and doesn't even model an accurate cache (the sizes should be ok, but replacement policy and other micro-architectural details aren't guaranteed to be the same), so you won't see the full effect. It's probably better to view the actual HW behavior using perf counters.