linuxx86memory-barriersrcu

rcu_read_lock and x86-64 memory ordering


On a preemptible SMP kernel, rcu_read_lock compiles the following:

current->rcu_read_lock_nesting++;
barrier();

With barrier being a compiler directive that compiles to nothing.

So, according to Intel's X86-64 memory ordering white paper:

Loads may be reordered with older stores to different locations

why is the implementation actually OK?

Consider the following situation:

rcu_read_lock();
read_non_atomic_stuff();
rcu_read_unlock();

What prevents read_non_atomic_stuff from "leaking" forward past rcu_read_lock, causing it to run concurrently with the reclamation code running in another thread?


Solution

  • For observers on other CPUs, nothing prevents this. You're right, StoreLoad reordering of the store part of ++ can make it globally visible after some of your loads.

    Thus we can conclude that current->rcu_read_lock_nesting is only ever observed by code running on this core, or that has remotely triggered a memory barrier on this core by getting scheduled here, or with a dedicated mechanism for getting all cores to execute a barrier in a handler for an inter-processor interrupt (IPI). e.g. similar to the membarrier() user-space system call.


    If this core starts running another task, that task is guaranteed to see this task's operations in program order. (Because it's on the same core, and a core always sees its own operations in order.) Also, context switches might involve a full memory barrier so tasks can be resumed on another core without breaking single-threaded logic. (That would make it safe for any core to look at rcu_read_lock_nesting when this task / thread is not running anywhere.)

    Notice that the kernel starts one RCU task per core of your machine; e.g. ps output shows [rcuc/0], [rcuc/1], ..., [rcu/7] on my 4c8t quad core. Presumably they're an important part of this design that lets readers be wait-free with no barriers.

    I haven't looked into full details of RCU, but one of the "toy" examples in https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt is "classic RCU" that implements synchronize_rcu() as for_each_possible_cpu(cpu) run_on(cpu);, to get the reclaimer to execute on every core that might have done an RCU operation (i.e. every core). Once that's done, we know that a full memory barrier must have happened in there somewhere as part of the switching.

    So yes, RCU doesn't follow the classic method where you'd need a full memory barrier (including StoreLoad) to make the core wait until the first store was visible before doing any reads. RCU avoids the overhead of a full memory barrier in the read path. This is one of the major attractions for it, besides the avoidance of contention.