What does "store-buffer forwarding" mean in the Intel developer's manual?
Asked Answered
M

3

15

The Intel 64 and IA-32 Architectures Software Developer's Manual says the following about re-ordering of actions by a single processor (Section 8.2.2, "Memory Ordering in P6 and More Recent Processor Families"):

Reads may be reordered with older writes to different locations but not with older writes to the same location.

Then below when discussing points where this is relaxed compared to earlier processors, it says:

Store-buffer forwarding, when a read passes a write to the same memory location.

As far as I can tell, "store-buffer forwarding" isn't precisely defined anywhere (and neither is "pass"). What does it mean for a read to pass a write to the same location here, given that above it says that reads can't be reordered with writes to the same location?

Melena answered 12/6, 2014 at 4:46 Comment(1)
Related: Can a speculatively executed CPU branch contain opcodes that access RAM? describes what a store buffer is and why it exists, separately from its effect on the memory model. (Which for x86 normal loads/stores (not NT) is pretty much program-order + store-buffer with store-forwarding; see Globally Invisible load instructionsFarce
T
19

The naming is a bit awkward. The "forwarding" happens inside a core/logical processor, as follows. If you first do a STORE, it will go to the store buffer to be flushed to memory asynchronously. If you do a subsequent LOAD to the same location ON THE SAME PROCESSOR before the value is flushed to the cache/memory, the value from the store buffer will be "forwarded" and you will get the value that was just stored. The read is "passing" the write in that it happens before the actual write from store-buffer to memory (which has yet to happen).

The statement isn't saying much actually if you just care about the ordering rules - this forwarding is a detail of what they do internally to guarantee that (on a processor) reads are not reordered with older writes to the same location (part of the rule you quoted).

Despite what some of the other answers here state, there is (at least as far as ordering guarantees go) NO store-buffer forwarding/snooping between processors/cores, as the 8.2.3.5 "Intra-Processor Forwarding Is Allowed" example in the manual shows.

Triny answered 8/9, 2014 at 13:10 Comment(3)
The store buffer is the cause of memory reordering on x86. The memory model is basically program-order plus a store-buffer with store forwarding. The "not reordered with older writes to the same location" phrasing apparently only means that a load can see stores done by the same core. It does not mean anything stronger that you might expect, otherwise a store/reload would effectively be a full memory barrier. But as Can x86 reorder a narrow store with a wider load that fully contains it? shows, that reordering is possible on real CPUs.Farce
See also Globally Invisible load instructions. (And for more about why a store buffer exists in the first place, Can a speculatively executed CPU branch contain opcodes that access RAM?)Farce
re: snooping between cores: indeed, that would violate the total-store-order guarantee. Some PowerPC CPUs do that between logical cores of one physical core, and that's the source of IRIW reordering (where threads can disagree about what order two stores happened in. Will two atomic writes to different locations in different threads always be seen in the same order by other threads?)Farce
Z
14

I'd guess that the hang-up is the notion of a "store-buffer". Starting point is the great disparity between the speed of a processor core and the speed of memory. A modern core can easily execute a dozen instructions in a nanosecond. But a RAM-chip can require 150 nanoseconds to deliver a value stored in memory. That is an enormous mismatch, modern processors are filled to the brim with tricks to work around that problem.

Reads are the harder problem to solve, a processor will stall and not execute any code when it needs to wait for the memory sub-system to deliver a value. An important sub-unit in a processor is the prefetcher. It tries to predict what memory locations will be loaded by the program. So it can tell the memory sub-system to read them ahead of time. So physical reads occur much sooner than the logical loads in your program.

Writes are easier, a processor has a buffer for them. Model them like a queue in software. So the execution engine can quickly dump the store instruction into the queue and won't get bogged down waiting for the physical write to occur. This is the store-buffer. So physical writes to memory occur much later than the logical stores in your program.

The trouble starts when your program uses more than one thread and they access the same memory locations. Those threads will run on different cores. Many problems with this, ordering becomes very important. Clearly the early reads performed by the prefetcher causes it to read stale values. And the late writes performed by the store buffer make it worse yet. Solving it requires synchronization between the threads. Which is very expensive, a processor is easily stalled for dozens of nanoseconds, waiting for the memory sub-system to catch up. Instead of threads making your program faster, they can actually make it slower.

The processor can help, store-buffer forwarding is one such trick. A logical read in one thread can pass a physical write initiated by another thread when the store is still in the buffer and has not been executed yet. Without synchronization in the program that will always cause the thread to read a stale value. What store-buffer forwarding does is look through the pending stores in the buffer and find the latest write that matches the read address. That "forwards" the store in time, making it look like it was executed earlier than it will be. The thread gets the actual value; the one that, eventually, ends up in memory. The read no longer passes the write.

Actually writing a program that takes advantage of store-buffer forwarding is rather unadvisable. Short from the very iffy timing, such a program will port very, very poorly. Intel processors have a strong memory model with the ordering guarantees it provides. But you can't ignore the kind of processors that popular on mobile devices these days. Which consume a lot less power by not providing such guarantees.

And the feature can in fact be very detrimental, it hides synchronization bugs in your code. They are the worst possible bugs to diagnose. Micro-processors have been staggering successful over the past 30 years. They however did not get easier to program.

Zirkle answered 12/6, 2014 at 8:39 Comment(3)
Thanks, that's a nice explanation of store-buffer forwarding. I suppose the important part here is that the read passes in front of the physical write, but not the program order "logical" write. To clarify: are the writing thread and the reading thread running on the same core or different ones? That is to say, can/does one core snoop into the store buffer of another? If you update your answer to address that, I'll mark it as accepted. Thanks again!Melena
Different cores, snooping is real afaik. Hyperthreading and NUMA complicates the story, I don't know enough about it.Zirkle
@Melena - no, on x86 anyway, stores on one logical thread cannot be forwarded to loads from the other logical processor on the same core, since it would violate the x86 memory model. In fact, inter-logical core sharing is quite tricky: stores on one thread will snoop the load buffer of the other thread and if there is a hit, you'll get a "machine clear" which basically nukes the pipeline. That's to avoid another ordering violation because the threads share an L1 (so MESI is out of the picture and you need another mechanism).Groundhog
B
3

8.2.3.5 "Intra-Processor Forwarding Is Allowed" explains an example of store-buffer forwarding:

Initially x = y = 0

    Processor 0             Processor 1
   ==============          =============
    mov [x], 1              mov [y], 1
    mov r1, [x]             mov r3, [y]
    mov r2, [y]             mov r4, [x]

The result r2 == 0 and r4 == 0 is allowed.

... the reordering in this example can arise as a result of store-buffer forwarding. While a store is temporarily held in a processor's store buffer, it can satisfy the processor's own loads but is not visible to (and cannot satisfy) loads by other processors.

The statement that says reads can't be reordered with writes to the same location ("Reads may be reordered with older writes to different locations but not with older writes to the same location") is in a section that applies to "a single-processor system for memory regions defined as write-back cacheable". The "store-buffer forwarding" behavior applies to multi-processor behavior only.

Bother answered 12/6, 2014 at 5:30 Comment(3)
I've seen that, and the example works totally as I would expect. But I don't see how it demonstrates "a read pass[ing] a write to the same memory location". In this case the read and the write are concurrent -- they have no defined ordering to begin with. I don't see the sense in which one is passing the other.Melena
@jacobsa: consider loading r2. From the point of view of Processor 0 - it has to occur after write to x. Similarly on Processor 1, the load of r4 has to occur after the write to y. If you don't permit store forwarding, then if P0 reads y as 0, then all three of P0's instructions would have had to execute before P1 performed it's first instruction. Therefore P1 would have to read 1 out of x. Similar logic applies if you consider P1 reading a 0 from location x if reordering isn't permitted.Bother
Thanks. I totally understand the example and its consequences. I guess I'm just caught up on wording, but I still don't see where a read "passes" a write to the same memory location. Which memory location in this example, and in what sense did a read start on one side of a write and migrate to the other side? They began unordered (since they're on different processors), as far as I can tell.Melena

© 2022 - 2024 — McMap. All rights reserved.