concurrency x86 atomic lock-free compare-and-swap

Is x86 CMPXCHG atomic, if so why does it need LOCK?

This instruction can be used with a LOCK prefix to allow the instruction to be executed atomically.

My question is

Can CMPXCHG operate with memory address? From the document it seems not but can anyone confirm that only works with actual VALUE in registers, not memory address?
If CMPXCHG isn't atomic and a high level language level CAS has to be implemented through LOCK CMPXCHG (with LOCK prefix), what's the purpose of introducing such an instruction at all?

(I am asking from a high level language perspective. I.e., if the lock-free algorithm has to be translated into a LOCK CMPXCHG on the x86 platform, then it's still prefixed with LOCK. That means the lock-free algorithms are not better than ones with a carefully written synchronized lock / mutex (on x86 at least). This also seems to make the naked CMPXCHG instruction pointless, as I guess the major point for introducing it, was to support such lock-free operations.)

Solution

It seems like part what you're really asking is:

Why isn't the lock prefix implicit for cmpxchg with a memory operand, like it is for xchg (since 386)?

The simple answer (that others have given) is simply that Intel designed it this way. But this leads to the question:

Why did Intel do that? Is there a use-case for cmpxchg without lock?

On a single-CPU system, cmpxchg is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg was relevant even on uniprocessor CPU designs).

Context switches can only happen on interrupts, and interrupts happen before or after an instruction, not in the middle. Any code running on the same CPU will see the cmpxchg as either fully executed or not at all.

For example, the Linux kernel is normally compiled with SMP support, so it uses lock cmpxchg for atomic CAS. But when booted on a single-processor system, it will patch the lock prefix to a ds prefix everywhere that code was inlined, since plain cmpxchg without the lock runs much faster than lock cmpxchg. (The ds prefix has no effect except to take up the space; Linux uses a flat memory model so even in 32-bit code using (%ebp) or (%esp) addressing modes, it's still the same as a plain cmpxchg.) For more info, see this LWN article about Linux's "SMP alternatives" system. It can even patch back to lock prefixes before hot-plugging a second CPU.

Read more about atomicity of single instructions on uniprocessor systems in this answer, and in @supercat's answer + comments on Can num++ be atomic for int num. See my answer there for lots of details about how atomicity really works / is implemented for read-modify-write instructions like lock cmpxchg.

(This same reasoning also applies to cmpxchg8b / cmpxchg16b, and xadd, which are usually only used for synchonization / atomic ops, not to make single-threaded code run faster. Of course memory-destination instructions like add [mem], reg have obvious uses for non-shared data.)

Interrupting instruction in the middle of execution only a few instructions like rep movsb and vpgatherdd are interruptible part way through, and they don't support lock. They also have a well-defined way to update architectural state to record their partial progress, not like a few ISAs where microarchitectural progress can get saved in hidden locations and resumed after an interrupt.
Interrupting an assembly instruction while it is operating quotes Intel's manuals about that guarantee
When an interrupt occurs, what happens to instructions in the pipeline?

Is x86 CMPXCHG atomic, if so why does it need LOCK?

Related: