Why is a store-load barrier considered expensive?
Asked Answered
E

1

17

Most CPU architectures will re-order stores-load operations, but my question is why? My interpretation of a store-load barrier would look like this:

x = 50;
store_load_barrier;
y = z;

Furthermore, I don't see how this barrier would be have much use in lock-free programming in comparison to release and acquire semantics.

Eliason answered 14/12, 2014 at 22:46 Comment(1)
Related: How does memory reordering help processors and compilers? - forcing it to not do that defeats those wins from pipelining.Polemoniaceous
Y
23

Short Answer: The store-load barrier prevents the processor from speculatively executing LOAD that come after a store-load barrier until all previous stores have completed.

Details:

The reason that a store-load barrier is expensive is the it prevents the reordering of LOAD and STORE operations across the barrier.

Suppose you had an instruction sequence like the following:

...             ;; long latency operation to compute r1
ST r1, [ADDR1]  ;; store value in r1 to memory location referenced by ADDR1
LD r3, [ADDR2]  ;; load r3 with value in memory location ADDR2
...             ;; instructions that use result in r3

When this sequence executes that the value of r1 will be the result of an operation that take a long time to complete. The instruction ST r1, [ADDR1] will have to stall until r1 is read In the meantime an out-of-order processor can speculatively execute the LD r3, [ADDR2] and other instructions if they are independent of the earlier store. They won't actually commit until the store is committed, but by doing most of the work speculatively the results can be saved in the reorder buffer and ready to commit more quickly.

This works for a single-processor system because the CPU can check whether there are dependencies between ADDR1 and ADDR2. But in an multiprocessor system multiple CPUs can independently executes loads and stores. There might be multiple processors that are performing a ST to ADDR1 and a LD from ADDR2. If the CPUs are able to speculatively execute these instructions that don't appear to have dependencies then different CPUs might see different results. I think the following blog post gives a good explanation of how this can happen (I don't think it's something I could summarize succinctly in this answer).

Now consider the code sequence that has a store-load barrier:

...             ;; long latency operation to compute r1
ST r1, [ADDR1]  ;; store value in r1 to memory location referenced by ADDR1
ST_LD_BARRIER   ;; store-load barrier
LD r3, [ADDR2]  ;; load r3 with value in memory location ADDR2
...             ;; instructions that use result in r3

This would prevent the LD r3, [ADDR2] instruction and following dependent instructions from being speculatively executed until the previous store instructions complete. And this could reduce the CPU performance because entire CPU pipeline might have to stall while waiting for the ST instruction to complete, even though in the CPU itself there is no dependency between the LD and the ST.

There are some things that can be done to limit the amount that the CPU has to stall. But the bottom line is that the store-load barrier creates additional dependencies between loads and stores and this limits the amount of speculative execution that the CPU can perform.

Yasui answered 15/12, 2014 at 5:27 Comment(6)
I think you might have mentioned it in a similar context, but the load can also be 'reordered' if the store and load are from the same address. Am I correct in thinking the load might search the thread's store buffer and retrieve the value, rather than going to main memory which would be more expensiveEliason
@William, yes the store's value will be forwarded to the dependent load when possible. Here's a link that gives some more details.Yasui
Great answer. Regarding the comments - memory disambiguation is only needed when the addresses are not yet known, but even if they are - there's still a long window for forwarding in case the store misses the caches and goes to memory.Hod
@gabriel, I think there is a typo in the asm code fragments. "load r1 with" should read "load r3 with", in two places. Not editing it directly in case I missed something.Marthmartha
It would be more useful to compare it against LoadLoad, StoreStore and LoadStore barriers. They also stall the CPU from executing memory operations out of order, don't they? But somehow all articles I have read so far has mentioned that StoreLoad barrier is special and more expensive.Tony
LoadLoad, StoreStore and LoadStore on x86 are all no-ops due to strong instruction ordering guarantees. For more information see cs.cmu.edu/~410-f10/doc/Intel_Reordering_318147.pdfHimelman

© 2022 - 2024 — McMap. All rights reserved.