How does Facebook folly::AccessSpreader work?

Here is the code for AccessSpreader from the Facebook Folly library: https://github.com/facebook/folly/blob/master/folly/concurrency/CacheLocality.h#L212

/// AccessSpreader arranges access to a striped data structure in such a
/// way that concurrently executing threads are likely to be accessing
/// different stripes.  It does NOT guarantee uncontended access.
/// Your underlying algorithm must be thread-safe without spreading, this
/// is merely an optimization.  AccessSpreader::current(n) is typically
/// much faster than a cache miss (12 nanos on my dev box, tested fast
/// in both 2.6 and 3.2 kernels).
///
/// If available (and not using the deterministic testing implementation)
/// AccessSpreader uses the getcpu system call via VDSO and the
/// precise locality information retrieved from sysfs by CacheLocality.
/// This provides optimal anti-sharing at a fraction of the cost of a
/// cache miss.
///
/// When there are not as many stripes as processors, we try to optimally
/// place the cache sharing boundaries.  This means that if you have 2
/// stripes and run on a dual-socket system, your 2 stripes will each get
/// all of the cores from a single socket.  If you have 16 stripes on a
/// 16 core system plus hyperthreading (32 cpus), each core will get its
/// own stripe and there will be no cache sharing at all.
///
/// AccessSpreader has a fallback mechanism for when __vdso_getcpu can't be
/// loaded, or for use during deterministic testing.  Using sched_getcpu
/// or the getcpu syscall would negate the performance advantages of
/// access spreading, so we use a thread-local value and a shared atomic
/// counter to spread access out.  On systems lacking both a fast getcpu()
/// and TLS, we hash the thread id to spread accesses.
///
/// AccessSpreader is templated on the template type that is used
/// to implement atomics, as a way to instantiate the underlying
/// heuristics differently for production use and deterministic unit
/// testing.  See DeterministicScheduler for more.  If you aren't using
/// DeterministicScheduler, you can just use the default template parameter
/// all of the time.
template <template <typename> class Atom = std::atomic>
struct AccessSpreader {
  /// Returns the stripe associated with the current CPU.  The returned
  /// value will be < numStripes.
  static size_t current(size_t numStripes) {
    // widthAndCpuToStripe[0] will actually work okay (all zeros), but
    // something's wrong with the caller
    assert(numStripes > 0);

    unsigned cpu;
    getcpuFunc(&cpu, nullptr, nullptr);
    return widthAndCpuToStripe[std::min(size_t(kMaxCpus), numStripes)]
                              [cpu % kMaxCpus];
  }

 private:
  /// If there are more cpus than this nothing will crash, but there
  /// might be unnecessary sharing
  enum { kMaxCpus = 128 };

  typedef uint8_t CompactStripe;

  static_assert(
      (kMaxCpus & (kMaxCpus - 1)) == 0,
      "kMaxCpus should be a power of two so modulo is fast");
  static_assert(
      kMaxCpus - 1 <= std::numeric_limits<CompactStripe>::max(),
      "stripeByCpu element type isn't wide enough");

  /// Points to the getcpu-like function we are using to obtain the
  /// current cpu.  It should not be assumed that the returned cpu value
  /// is in range.  We use a static for this so that we can prearrange a
  /// valid value in the pre-constructed state and avoid the need for a
  /// conditional on every subsequent invocation (not normally a big win,
  /// but 20% on some inner loops here).
  static Getcpu::Func getcpuFunc;

  /// For each level of splitting up to kMaxCpus, maps the cpu (mod
  /// kMaxCpus) to the stripe.  Rather than performing any inequalities
  /// or modulo on the actual number of cpus, we just fill in the entire
  /// array.
  static CompactStripe widthAndCpuToStripe[kMaxCpus + 1][kMaxCpus];

  static bool initialized;

  /// Returns the best getcpu implementation for Atom
  static Getcpu::Func pickGetcpuFunc() {
    auto best = Getcpu::resolveVdsoFunc();
    return best ? best : &FallbackGetcpuType::getcpu;
  }

  /// Always claims to be on CPU zero, node zero
  static int degenerateGetcpu(unsigned* cpu, unsigned* node, void*) {
    if (cpu != nullptr) {
      *cpu = 0;
    }
    if (node != nullptr) {
      *node = 0;
    }
    return 0;
  }

  // The function to call for fast lookup of getcpu is a singleton, as
  // is the precomputed table of locality information.  AccessSpreader
  // is used in very tight loops, however (we're trying to race an L1
  // cache miss!), so the normal singleton mechanisms are noticeably
  // expensive.  Even a not-taken branch guarding access to getcpuFunc
  // slows AccessSpreader::current from 12 nanos to 14.  As a result, we
  // populate the static members with simple (but valid) values that can
  // be filled in by the linker, and then follow up with a normal static
  // initializer call that puts in the proper version.  This means that
  // when there are initialization order issues we will just observe a
  // zero stripe.  Once a sanitizer gets smart enough to detect this as
  // a race or undefined behavior, we can annotate it.

  static bool initialize() {
    getcpuFunc = pickGetcpuFunc();

    auto& cacheLocality = CacheLocality::system<Atom>();
    auto n = cacheLocality.numCpus;
    for (size_t width = 0; width <= kMaxCpus; ++width) {
      auto numStripes = std::max(size_t{1}, width);
      for (size_t cpu = 0; cpu < kMaxCpus && cpu < n; ++cpu) {
        auto index = cacheLocality.localityIndexByCpu[cpu];
        assert(index < n);
        // as index goes from 0..n, post-transform value goes from
        // 0..numStripes
        widthAndCpuToStripe[width][cpu] =
            CompactStripe((index * numStripes) / n);
        assert(widthAndCpuToStripe[width][cpu] < numStripes);
      }
      for (size_t cpu = n; cpu < kMaxCpus; ++cpu) {
        widthAndCpuToStripe[width][cpu] = widthAndCpuToStripe[width][cpu - n];
      }
    }
    return true;
  }
};

template <template <typename> class Atom>
Getcpu::Func AccessSpreader<Atom>::getcpuFunc =
    AccessSpreader<Atom>::degenerateGetcpu;

template <template <typename> class Atom>
typename AccessSpreader<Atom>::CompactStripe
    AccessSpreader<Atom>::widthAndCpuToStripe[kMaxCpus + 1][kMaxCpus] = {};

template <template <typename> class Atom>
bool AccessSpreader<Atom>::initialized = AccessSpreader<Atom>::initialize();

// Suppress this instantiation in other translation units. It is
// instantiated in CacheLocality.cpp
extern template struct AccessSpreader<std::atomic>;

As far as I understand it's supposed to wrap some data within an atomic class and when it's accessed by multiple threads it should reduce false cache sharing? Could someone who has worked with Folly elaborate a bit how it might work? I've been looking at it for a while and I don't even see where they put the atomic variable member.

Solution

No, this class does not do what you think.

The overall idea is that when you have a number of equivalent resources/data structures, and want different threads to access different instances to minimize contention and maximize data locality, you use AccessSpreader to suggest the best resource/data to use for the current core/thread.

For an example, see e.g. https://github.com/facebook/folly/blob/master/folly/IndexedMemPool.h. This implementation of a memory pool maintains a number of free object lists, to reduce thread contention on allocation/deallocation. And here is how AccessSpreader is used:

AtomicStruct<TaggedPtr,Atom>& localHead() {
  auto stripe = AccessSpreader<Atom>::current(NumLocalLists);
  return local_[stripe].head;
}

i.e. it gives an index of an element (in some array or vector etc.) that is recommended for use by the current thread.

Update (in response to the comment): it is not always possible to assign different indices to different threads - e.g. if the number of possible indices (stripes) is less than the number of CPUs; and the comments explicitly tell that "It does NOT guarantee uncontended access". The class can be used not only to minimize contention but also to maximize data locality; for example, you might want to share some instances of data between threads that have common cache. So, the recommended index is a function of two variables: the current CPU (obtained internally with getCpuFunc) and the number of stripes (passed as the parameter numStripes) - this is why a 2D array is needed. The array is filled at program initialization time using system-specific information (via class CacheLocality), so that the recommended index takes data locality into account.

As for std::atomic, it is used merely to have separate AccessSpreader instantiations for testing and for production use, as explained in the comment right before the class declaration. The class does not have (and does not need) any atomic member variables.