Why is LOCK a full barrier on x86?
Asked Answered
L

1

10

Why does the LOCK prefix cause a full barrier on x86? (And thus it drains the store buffer and has sequential consistency)

For LOCK/read-modify-write operations, a full barrier shouldn't be required and exclusive access to the cache line seems to be sufficient. Is it a design choice or is there some other limitation?

Left answered 21/2, 2020 at 5:16 Comment(15)
short answer: it's both a load and a store (which have to stay atomically together in the global order of operations), so it can't reorder with either in either direction. So it ends up having to be a full barrier.Elector
@PeterCordes I though about that, however it is a load-then-store and x86 memory model already prohibits LoadStore reorderings. Isn't it sufficient?Left
Yes, but consider some examples, e.g. RMW then a load. Can the RMW be delayed and appear after the load, like a normal store? No, because it would bring its load with it, and that would be LoadLoad reordering.Elector
@PeterCordes Uhm I see, so in that case it would be to prevent the other load to "sneak" between the RMW load & store? (which would lose its atomicity)Left
(Which also happens if a store coming before the RMW gets reordererd with its load, making LOCK boundaries effectively a full barrier?)Left
pretty much. AFAICT, the only difference between an acq_rel RMW and a seq_cst RMW ISO C++ is that acq_rel doesn't forbid IRIW reordering (when the load part observes a pure store from another core), but x86's total store order never allows that. Although see comments: How do memory_order_seq_cst and memory_order_acq_rel differ?Elector
RMWs on LL/SC architectures are trickier to think about. One attempt I made: What exact rules in the C++ memory model prevent reordering before acquire operations?. You can reorder as long as the final result is compatible with there being an atomic RMW somewhere in the modification order of the target cache line, and in any global order any other core could see. Planning to write a proper answer soon, but leaving comments while I think about it.Elector
I see, very interesting. Thanks for the useful explanations!Left
To support a "relaxed" read-modify-write, I think you are right, locking the cache line(s) would ensure that no other thread's write can become visible between the read and the write. But either (a) that lock must be held until the store buffer drains, or (b) the write would need to jump the buffer. I guess (b) adds some complexity, plus it would be incompatible with previous LOCK implementations. LOCK has not been a global lock since the 486 (I believe). What the processor does is like (a) -- which maps to seq_cst given the general 'strength' of the x86 memory model.Anorthite
@PeterCordes the only difference between an acq_rel RMW and a seq_cst RMW ISO C++ is that acq_rel doesn't forbid IRIW reordering; Considering e.g. POWER, if an acq_rel RMW has to guarantee a StoreLoad order (otherwise a LoadLoad may result if two RMW operations are reordered), then it has to drain the store buffer; In that case IRIW is not possible, isn't it a contradiction to the fact that acq_rel RMW doesn't forbid IRIW?Finkelstein
@DanielNitzan: ISO C++'s formal rules can be weaker than any real ISA in practice. I'm pretty sure POWER can still do IRIW between the outputs of two acq_rel RMWs if the observers are acquire loads (not what I described in my earlier comment), unless stwcx stores completely bypass the store buffer? I'm not sure about observing with two acq_rel exchanges or fetch_add(0)s in each reader thread, though. You could ask that as a separate SO question about POWER; it's too complex for these comments.Elector
@DanielNitzan: Note that godbolt.org/z/Pvxc99 shows that acq_rel fetch_add does not include a sync, only lwsync before (release) and isync after (acquire), but two such RMWs back to back might be enough to make it impossible or at least implausible on real hardware.Elector
@DanielNitzan: The store buffer is the mechanism on real HW as you say, but note that even the SC version only does sync before the RMW retry loop. (SC pure-load costs a sync on POWER). In any case, unless the store bypasses the SB entirely (which would actually make sense for an atomic store-conditional; I'm retracting my earlier "pretty sure"), it will be in the SB and probably graduate for at least a cycle before it actually commits, so there could be a window of opportunity for cross-SMT store forwarding before it becomes globally visible.Elector
@PeterCordes Thanks, my logic was flawed, I was trying to reason about reordering of two back to back RMW's, which is forbidden, though has nothing to do with a StoreLoad guarantee. SB delay and SLF can manifest themselves as you've mentioned above.Finkelstein
Update: My first comment isn't totally correct. Operations on other cache lines by other threads can come between the load and store of an atomic RMW in the global order of operations, even with x86's strongly ordered lock semantics. But on x86, locked RMWs are back-to-back in this core's operations (across all cache lines), unlike on LL/SC machines: For purposes of ordering, is atomic read-modify-write one operation or two?Elector
L
11

Long time ago, before the Intel 80486, Intel processors didn't have on-chip caches or write buffers. Therefore, by design, all writes become immediately globally visible in order and you didn't have to drain stores from anywhere. A locked transaction is executed by fully locking the bus for the entire address space.

In the 486 and Pentium processors, write buffers have been added on-chip and some models have on-chip caches as well. Consider first the models that don't have on-chip caches. All writes are temporarily held in on-chip write buffers until they are written on the bus when available or a serializing event occurs. Remember that atomic RMW transactions are used to acquire exclusive access to software structures or hardware resources. So if a processor performs a locked transaction, it shouldn't happen that the processor thinks that it got granted ownership of the resource but then another processor also somehow ends up obtaining ownership as well. If the write part of the locked transaction gets buffered in a write buffer and then the bus lock is relinquished, there is nothing that prevents other agents from also acquiring access to the resource at the same time. Essentially, the write part has to be made visible to all other agents and the way to do this is by not buffering it. But the x86 memory model requires that all writes become globally visible in order (there was no weak ordering on these processors). So in order to make the write part of a locked transaction globally observable, all buffered writes had also be made globally observable in the same order.

Some 486 models and all Pentium processors have on-chip caches. But on these processor, there was no support for cache locks. That's why locked transactions were not cacheable on these processors because the only way to guarantee atomicity was to bypass the cache and lock the bus. After acquiring the bus lock, one or more writes are performed depending on the alignment and size of the destination memory region. The write buffers still have to be drained before releasing the bus lock.

The Pentium Pro introduced some major changes including weakly-ordered writes, write-combining buffers, and cache locking. What was called "writes buffers" is what is usually referred to as store buffers on more modern microarchitectures. A locked transaction utilizes cache locking on these processors, but the cache lock cannot be released until committing the locked store from the store buffer to the cache, which makes the store globally observable, which necessarily requires making all earlier stores globally observable. These events have to happen in that order. That said, I don't think locked transactions have to serialize weakly-ordered writes, but Intel has decided to make them this way. Maybe because Intel wanted a convenient instruction that drains WC buffers on the PPro in the absence of a dedicated store fence.

Loophole answered 21/2, 2020 at 18:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.