Short Answer: The store-load barrier prevents the processor from speculatively executing LOAD that come after a store-load barrier until all previous stores have completed.
Details:
The reason that a store-load barrier is expensive is the it prevents the reordering of LOAD and STORE operations across the barrier.
Suppose you had an instruction sequence like the following:
... ;; long latency operation to compute r1
ST r1, [ADDR1] ;; store value in r1 to memory location referenced by ADDR1
LD r3, [ADDR2] ;; load r3 with value in memory location ADDR2
... ;; instructions that use result in r3
When this sequence executes that the value of r1
will be the result of an operation that take a long time to complete. The instruction ST r1, [ADDR1]
will have to stall until r1
is read In the meantime an out-of-order processor can speculatively execute the LD r3, [ADDR2]
and other instructions if they are independent of the earlier store. They won't actually commit until the store is committed, but by doing most of the work speculatively the results can be saved in the reorder buffer and ready to commit more quickly.
This works for a single-processor system because the CPU can check whether there are dependencies between ADDR1 and ADDR2. But in an multiprocessor system multiple CPUs can independently executes loads and stores. There might be multiple processors that are performing a ST to ADDR1 and a LD from ADDR2. If the CPUs are able to speculatively execute these instructions that don't appear to have dependencies then different CPUs might see different results. I think the following blog post gives a good explanation of how this can happen (I don't think it's something I could summarize succinctly in this answer).
Now consider the code sequence that has a store-load barrier:
... ;; long latency operation to compute r1
ST r1, [ADDR1] ;; store value in r1 to memory location referenced by ADDR1
ST_LD_BARRIER ;; store-load barrier
LD r3, [ADDR2] ;; load r3 with value in memory location ADDR2
... ;; instructions that use result in r3
This would prevent the LD r3, [ADDR2]
instruction and following dependent instructions from being speculatively executed until the previous store instructions complete. And this could reduce the CPU performance because entire CPU pipeline might have to stall while waiting for the ST instruction to complete, even though in the CPU itself there is no dependency between the LD and the ST.
There are some things that can be done to limit the amount that the CPU has to stall. But the bottom line is that the store-load barrier creates additional dependencies between loads and stores and this limits the amount of speculative execution that the CPU can perform.