go

Is it possible to use CPU cache in Golang?


Consider some memory and CPU intensive task:
e.g.: Task Block: read 16 bytes from memory then do CPU job. Then write back to memory.
And this Task Block can be parallelizable meaning each core can ran one Task Block.
e.g.: 8 CPU needs 8*16 byte cache but concurrently.


Solution

  • Yes, and just like all other code running on your machine, they all use CPU cache.

    It's much too broad of a question to tell you how to code your app to make it the most efficient use of cache. I highly recommend setting up Go Benchmarks and then refactor your code and compare times. (Note, do not benchmark within a VM - VMs, and kind on any platform, do not have accurate enough clocks for Go's benchmarking. Run all Benchmarks native to your OS instead, no VM).

    It all comes down to your ability to code the application to make efficient use of that CPU cache. This is a much broader topic for how you use your variables, how often they get updated, what stays on the heap or gets GC on the stack and how often, etc.

    One tiny example to point you in the right direction to read more about efficient L1 and L2 cache development...

    L1 cache uses 64 bit rows. If you want to store 4x 16bit Int16s, typically they will be allocated on the stack and most likely all stored on the same row of cache.

    Say you want to update one of the Int16s? Well, CPU cache cannot update part of the row: It will have to invalidate the entire row, and allocate a whole new row of cache with the previous 3 Int16s and your new updates value.

    That would be very inefficient if you are reading all 4 Int16s equally, as you have a 400% overhead to make that one update. One solution to that problem is use Int64s, which the CPU cache will only invalidate 1 row but yet keep the other 3 in cache for quick reads. The flip side of the coin is, what if you do want to refresh all 4 Int16s at the same time, constantly? The stor/retr commands are expensive on a per cache line if you used Int64s everywhere, whereas using 4x Int16 on a single row - updated with a single command - is far more efficient.

    Again, it highly depends on your use case: this may even slow things down if you are using a lot of context switching of those 4 ints (e.g. mutex locks). In which case that's a whole different problem to optimize.

    I recommend reading up on high frequency scaling and memory allocations on the stack vs heaps. And as a follow-up, I recommend reading up on Memory Alignments in Structs using Golang.