multithreadingsynchronizationcpu-architecturemulticorecompare-and-swap

Why is CompareAndSwap instruction considered expensive?


Why is CompareAndSwap instruction considered expensive?

I read in a book:

"Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."

Thanks!


Solution

  • "CAS isn't appreciably different than a normal store. Some of the misinformation regarding CAS probably arises from the original implementation of lock:cmpxchg (CAS) on Intel processors. The lock: prefix caused the LOCK# signal to be asserted, acquiring exclusive access to the bus. This didn't scale of course. Subsequent implementations of lock:cmpxchg leverage cache coherency protocol -- typically snoop-based MESI -- and don't assert LOCK#." - David Dice, Biased locking in HotSpot

    "Memory barriers are expensive, about as expensive as an atomic compareAndSet() instruction."

    This is quite true.
    E.g. on x86, a proper CAS on a multi-processor system has a lock prefix.
    The lock prefix results in a full memory barrier:

    "...locked operations serialize all outstanding load and store operations (that is, wait for them to complete)." ..."Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor." - Intel® 64 and IA-32 Architectures Software Developer’s Manual, Chapter 8.1.2.

    A memory barrier is in fact implemented as a dummy LOCK OR or LOCK AND in both the .NET and the JAVA JIT on x86/x64.
    On x86, CAS results in a full memory barrier.

    On PPC, it is different. An LL/SC pair - lwarx & stwcx - can be used to load the memory operand into a register, then either write it back if there was no other store to the target location, or retry the whole loop if there was. An LL/SC can be interrupted.
    It also does not mean an automatic full fence.
    Performance characteristics and behaviour can be very different on different architectures.
    But then again - a weak LL/SC is not CAS.