How important is optimizing cache usage?

I'm currently learning about the importance of utilisation the cache to its fullest.

As far as I understand it: if the RAM was your closet where you store your speedo and flipflops at the middle of the winter, the CACHE would be the chair in your room with all your daily clothes on it and as it is in real life its quicker to get dressed if your clothes are already on the chair.

What I don't understand is how important is this cache usage, and when should it trump other optimisations? Here is an example (I'm not particularly interested in this example, more so I'd like to know the order of magnitude of the effect of proper cache use):

Imagine I have a matrix: A[N][M], and I'd like to transpose it and copy it into matrix B[N][M]

I could take this cache-friendly approach which takes extra steps:

 int C[N*M];
 for (int i = 0; i < N; i++) {
     for (int j = 0; j < M; j++) {
         C[i*N+j] = A[i][j];
     }
 }
 for (int i = 0; i < N; i++) {
     for (int j = 0; j < M; j++) {
         B[i][j] = C[i*N+j]; 
     }
 }

Or this non-cache-friendly approach:

for (int i = 0; i < n; i++) {
  for (int j = 0; j < m; j++) {
    B[j][i] = A[i][j];
  }
}

I read my course-book, but it only says things like "very important" rather than give the order of magnitude of its effect. I also don't know the order of magnitude of the effect of other things which I could test against it, and so that's why I need a rule-of-thumb.

Solution

So I'll start with clearing up exactly how important cache is with some numbers. Most computers don't have a single cache, but instead will have multiple levels of cache (which you'll often see referred to as L1, L2, L3...). Each level of cache is a different level of tradeoff between size and speed, where L1 is relatively small but can be accessed in maybe 3 cycles. L2 is slightly larger but is further away and might cost maybe 10, and L3 might then cost 50, although timings can vary depending on if any of the caches are shared between multiple cores, and if so if the data being accessed is being used by another core.

On modern processors all levels of cache are usually on the CPU die, meaning they run at the speed of the processor. RAM chips have to be accessed over the motherboard, which runs significantly slower than the CPU. The size-speed tradeoff also still exists here, and while caches are usually measured in MB, RAM chips are measured in GB. The combined result of this is that accessing RAM can take 150 cycles or more.

There's also another level of caching that isn't usually included in cache hierarchies, which is pages. Basically you only have so much RAM in a system, but operating systems can run many programs at a time that might use more than that. In order to give the illusion of having more RAM operating systems can write some memory to disk when it isn't being used by the program that asked for it, and reuse the freed RAM for another program that needs it. When the original program tries to access the data again, the operating system will then have to read the data from disk and put it back into RAM, which you can imagine is another order of magnitude slower than reading data that's already in RAM.

Therefore, the example you gave is more like the relationship between L1 and L2 cache, main memory is more like having to get clothes out of a storage unit across town.

The CPU has ways of mitigating this though. Usually they'll have something called a hardware prefetcher, which will make a guess about what data you're likely to access next and preemptively start reading it into L1 cache. In the best case if you end up needing that data it'll already be in L1 cache, or at the very least on it's way there so you don't have to wait as long. Data is also accessed in granules called cache lines (usually 64 bytes on modern systems) so if you need to access data that's close together it'll already be in cache from a previous read.

So optimising for cache therefore mostly just means keep data that's usually accessed close together in time close together in space. That means when you read data from memory they'll likely be in the same cache line, and if you're accessing them in a pattern (i.e. sequentially) the prefetcher will be able to better guess the next data item you're going to read and have it already in cache before you need it.

Your example therefore probably isn't that much of a cache optimisation, because you're still accessing the same data, in the same pattern, except now you've added an additional location in between the source and destination. The compiler will probably eliminate the round trip and merge the loops, so you'd end up with something pretty much identical to the original. A better example of cache optimisation would be changing the algorithm that uses the transposed matrix: could that be changed so it accesses the matrix elements sequentially without having to transpose it?

The main example of cache optimisation that I've seen used as an introduction is a list search. A naive setup might look like this:

struct list_element
{
    list_element* next;
    int key;
    int value;
};
int search(int key, list_element* root)
{
    while (root != NULL)
    {
        if (root->key == key) return root->value;
        root = root->next;
    }
    return -1;
}

This is an example of poor cache optimisation because the elements aren't contiguous in memory. Because there's no pattern to the accesses the prefetcher will have a pretty bad time trying to guess what you'll access next, and you'll end up suffering pretty frequent cache misses.

You can improve this by making the list a single contiguous array:

struct list_element
{
    int key;
    int value;
};
int search(int key, list_element* array, int size)
{
    while (size-- != 0)
    {
        if (array->key == key) return array->value;
        ++array;
    }
    return -1;
}

Because we're now scanning the entire array in a line, the prefetcher will likely be able to guess exactly what we're going to access ahead of time and so we won't get any cache misses. And additionally because all the elements are in a single array each cache line will have multiple elements on it, meaning that any cache miss will return a line with multiple elements, unlike the previous example where each line might only have one element and we'd risk taking another cache miss when accessing the element after that.

But still, there's another improvement to be made. Remember that caches are finite in size, so if we try and fetch data and the cache is full we'll need to throw out some data that was already in there. Best case that'll be some data that someone else had just deleted and won't be using again, but worst case it'll be some data that the function that called us is going to use immediately after we return. Therefore it's usually a good idea to reduce the number of cache lines you need to fetch.

Looking at our example we're repeatedly accessing list_element structs, but you might note that we only actually use half of each struct: the key. We access the value exactly once, when we've found the matching element. Therefore, when we're fetching our cache line of list_elements, half of the space on that cache line is filled up with data that we probably won't need. We can therefore do some further cache optimisation by separating out the list_elements into two separate arrays for keys and values:

int search(int key, int* key_array, int* value_array, int size)
{
    while (size-- != 0)
    {
        if (*key_array == key) return *value_array;
        ++key_array;
        ++value_array;
    }
    return -1;
}

However, this argument about reducing cache line usage has an edge case: what if the list of list_elements already fits in a cache line? In that case, we'll now be making two cache line accesses (one for the key, one for the array) instead of just one (for both).

That brings up the most important part of cache optimisation: measuring and knowing your data. It's pretty much impossible to tell at a glance exactly how a given cache optimisation will perform without knowing the characteristics of the data that you're accessing. If the array is very short, then perhaps storing the keys and values together would improve cache performance by avoiding a cache miss on the values. If the list_elements in the first example were all stored in a single array (so accessing one would likely get a cache line with another list_element on it) then the cache performance wouldn't be as bad as we'd expected when we were assuming they were all spread out in memory. In fact, being able to add and remove elements without having to rearrange the rest of the list (as is the usual advantage of linked lists) could mean that some other algorithm using the list could be getting a speedup that would dwarf the savings from our cache optimisations.

Basically, I can't give you a rule of thumb to follow, because there isn't one. Think about your data, think about access patterns and what the typical values of data are going to look like, and then test your optimisations using a tool like google benchmark with some realistic values to check that you're actually seeing a worthwhile speedup. Typically you'll see more benefit the larger the amount of data you're operating on, but again there's always exceptions.