Long time ago, before the Intel 80486, Intel processors didn't have on-chip caches or write buffers. Therefore, by design, all writes become immediately globally visible in order and you didn't have to drain stores from anywhere. A locked transaction is executed by fully locking the bus for the entire address space.
In the 486 and Pentium processors, write buffers have been added on-chip and some models have on-chip caches as well. Consider first the models that don't have on-chip caches. All writes are temporarily held in on-chip write buffers until they are written on the bus when available or a serializing event occurs. Remember that atomic RMW transactions are used to acquire exclusive access to software structures or hardware resources. So if a processor performs a locked transaction, it shouldn't happen that the processor thinks that it got granted ownership of the resource but then another processor also somehow ends up obtaining ownership as well. If the write part of the locked transaction gets buffered in a write buffer and then the bus lock is relinquished, there is nothing that prevents other agents from also acquiring access to the resource at the same time. Essentially, the write part has to be made visible to all other agents and the way to do this is by not buffering it. But the x86 memory model requires that all writes become globally visible in order (there was no weak ordering on these processors). So in order to make the write part of a locked transaction globally observable, all buffered writes had also be made globally observable in the same order.
Some 486 models and all Pentium processors have on-chip caches. But on these processor, there was no support for cache locks. That's why locked transactions were not cacheable on these processors because the only way to guarantee atomicity was to bypass the cache and lock the bus. After acquiring the bus lock, one or more writes are performed depending on the alignment and size of the destination memory region. The write buffers still have to be drained before releasing the bus lock.
The Pentium Pro introduced some major changes including weakly-ordered writes, write-combining buffers, and cache locking. What was called "writes buffers" is what is usually referred to as store buffers on more modern microarchitectures. A locked transaction utilizes cache locking on these processors, but the cache lock cannot be released until committing the locked store from the store buffer to the cache, which makes the store globally observable, which necessarily requires making all earlier stores globally observable. These events have to happen in that order. That said, I don't think locked transactions have to serialize weakly-ordered writes, but Intel has decided to make them this way. Maybe because Intel wanted a convenient instruction that drains WC buffers on the PPro in the absence of a dedicated store fence.
LOCK
implementations.LOCK
has not been a global lock since the 486 (I believe). What the processor does is like (a) -- which maps to seq_cst given the general 'strength' of the x86 memory model. – Anorthitesync
, onlylwsync
before (release) andisync
after (acquire), but two such RMWs back to back might be enough to make it impossible or at least implausible on real hardware. – Electorsync
before the RMW retry loop. (SC pure-load costs async
on POWER). In any case, unless the store bypasses the SB entirely (which would actually make sense for an atomic store-conditional; I'm retracting my earlier "pretty sure"), it will be in the SB and probably graduate for at least a cycle before it actually commits, so there could be a window of opportunity for cross-SMT store forwarding before it becomes globally visible. – Electorlock
semantics. But on x86,lock
ed RMWs are back-to-back in this core's operations (across all cache lines), unlike on LL/SC machines: For purposes of ordering, is atomic read-modify-write one operation or two? – Elector