I learn 'Computer Organization and Design' RISC-V version by David A. Patterson, and on page 254 Elaboration have below code
below is book contents and related code:
While the code above implemented an atomic exchange, the following code would more efficiently acquire a lock at the location in register x20, where the value of 0 means the lock was free and 1 to mean lock was acquired:
addi x12, x0, 1
// copy locked value
again: lr.d x10, (x20)
// load-reserved to read lock
bne x10, x0, again
// check if it is 0 yet
sc.d x11, x12, (x20)
// attempt to store new value
bne x11, x0, again
// branch if store fails
which is changed from (based on) original after adding lock
Since the load-reserved returns the initial value, and the store-conditional returns 0 only if it succeeds, the following sequence implements an atomic exchange on the memory location specified by the contents of x20:
again:lr.d x10, (x20)
// load-reserved
sc.d x11, x23, (x20)
// store-conditional
bne x11, x0, again
// branch if store fails
addi x23, x10, 0
// put loaded value in x23
1- the book says addition of lock to the code by addi x12, x0, 1 // copy locked value
is 'more efficient' which I don't get where it is
2- I think this lock can't avoid 'spuriously fail' based on 'cache line' hardware design, am I right?
I think the authors mean fewer instructions than do{}while(x20->exchange(1) == 0)
, which is the obvious way to use their exchange function to take a spinlock. (Which is what their loop does. In C++ terms its somewhat like do{}while(! lock.cas_weak(0, 1));
, but the points they're making about asm efficiency are specific to the LL/SC)
Possibly also a benefit to not storing at all when the load sees a non-zero value. (So this core doesn't take exclusive ownership of the cache line when it can't do anything useful with it until after another core has stored a 0
to it.) But I'm not sure if lr.d
along would try to get exclusive ownership (send an RFO = read for ownership) in anticipation of an SC. It at least doesn't dirty the cache line if the compare fails, so it doesn't have to write it back.
That may also reduce livelock problems with multiple threads waiting for the lock, all running this loop.
For some related discussion on having multiple spinning, and read-only vs. atomic RMW accesses see Locks around memory manipulation via inline assembly (x86 which only has single-instruction CAS_strong, not LL/SC).