gocachingfalse-sharing

When shoud we use `CacheLinePad` to avoid false sharing?


It's well-known that using pad to make a struct exclusive one or more cache line is good for performance.

But for what scene, we should add a pad like the following to improve performance?
Are there some rules of thumb here?

import "golang.org/x/sys/cpu"

var S struct {
    _                   cpu.CacheLinePad
    A                   string
    _                   cpu.CacheLinePad
}

Solution

  • I've never really liked the term "false sharing". I think it would be better called "inappropriate sharing" or "oversharing". 😀

    But for what scene, we should add a pad like the following to improve performance? Are there some rules of thumb here?

    The rule is: measure first (benchmark). Then, if a lot of time is being spent somewhere, figure out why.

    "False sharing" causes performance problems if and when the underlying software and hardware you're using insists on moving data around in slow ways, even though there are faster ways available. By contorting your own code, you can convince the software and/or hardware to use the faster ways.

    Doing so often makes your own code less readable, or take more space, or has some other similar drawback. Be sure that the cost of this drawback is exceeded by the value of the increased speed. If your software runs at the same speed or slower, the cost of damaging your code for speed was not paid-for, so don't have done it.1

    The usual case of "false sharing"—which is why I dislike the term—occurs when some data in some data structure could be shared well, used by multiple CPUs in multiple caches, except that some particular data-item write (store operation) happens such that one CPU invalidates all the other CPUs' caches, so that all the other CPUs must go back to main memory or re-copy the data from the writing CPU. The "insert padding" trick you describe helps if and when the writing-CPU no longer affects the other CPU's use of adjacent data items because those items, although adjacent in logical terms (e.g., in successive elements of an array or slice), no longer occupy a single cache line that becomes invalidated by the write.

    Suppose, for instance, that we have a data structure in which there are three (or perhaps seven) eight-byte fields that every CPU in a many-CPU machine will read, and a final eight-byte field that one of those CPUs (but only one) might update. Suppose further that the cache line size on this machine is 32 (or perhaps 64) bytes, and that the CPUs themselves use something like the MESI or MOESI cache model. In this case, the one CPU that writes to the one eight-byte field immediately invalidates any shared copies that exist in all the other CPUs' caches.

    If, however, that particular eight-byte field, that will be written by one CPU, is in its own cache line, or at least not in the shared cache line—e.g., is in a separate array—then the writing CPU does not invalidate any shared copies; these stay in the S (shared) state in all the CPUs.

    If a compiler can move the read-only and read/write fields of some data structure(s) around, so that the shareable parts that will benefit, time-wise, from being shared, stay shareable, you will not need to tweak your own code. Go, like C and C++, puts some constraints on compilers that may prevent them from doing their own optimizations here, which means you might have to do it yourself.

    But always measure first!


    1This is similar to the rule for making money in the stock market: buy a stock if it's going to go up. If it did not go up, don't have bought it. But at least the computer version is actually achievable, since you can run both your original version, and your price-paid contorted version, and see if the price you paid was worth the gain.