It seems like part what you're really asking is:
Why isn't the lock
prefix implicit for cmpxchg
with a memory operand, like it is for xchg
(since 386)?
The simple answer (that others have given) is simply that Intel designed it this way. But this leads to the question:
Why did Intel do that? Is there a use-case for cmpxchg
without lock
?
On a single-CPU system, cmpxchg
is atomic with respect to other threads, or any other code running on the same CPU core. (But not to "system" observers like a memory-mapped I/O device, or a device doing DMA reads of normal memory, so lock cmpxchg
was relevant even on uniprocessor CPU designs).
Context switches can only happen on interrupts, and interrupts happen before or after an instruction, not in the middle. Any code running on the same CPU will see the cmpxchg
as either fully executed or not at all.
For example, the Linux kernel is normally compiled with SMP support, so it uses lock cmpxchg
for atomic CAS. But when booted on a single-processor system, it will patch the lock
prefix to a ds
prefix everywhere that code was inlined, since plain cmpxchg
without the lock
runs much faster than lock cmpxchg
. (The ds
prefix has no effect except to take up the space; Linux uses a flat memory model so even in 32-bit code using (%ebp)
or (%esp)
addressing modes, it's still the same as a plain cmpxchg
.) For more info, see this LWN article about Linux's "SMP alternatives" system. It can even patch back to lock
prefixes before hot-plugging a second CPU.
Read more about atomicity of single instructions on uniprocessor systems in this answer, and in @supercat's answer + comments on Can num++
be atomic for int num
. See my answer there for lots of details about how atomicity really works / is implemented for read-modify-write instructions like lock cmpxchg
.
(This same reasoning also applies to cmpxchg8b
/ cmpxchg16b
, and xadd
, which are usually only used for synchonization / atomic ops, not to make single-threaded code run faster. Of course memory-destination instructions like add [mem], reg
have obvious uses for non-shared data.)
Related:
lock
if it didn't itself exist? – AgustinaahLOCK
. The high-level locks that lock-free algorithms try to avoid will have to put threads into wait state until the lock is available which is a costly operation and an entirely different thing than the CPULOCK
prefix feature which might hold other threads for a single instruction only. – Emrick