Out of Order Execution and Memory Fences
Asked Answered
T

3

11

I know that modern CPUs can execute out of order, However they always retire the results in-order, as described by wikipedia.

"Out of Oder processors fill these "slots" in time with other instructions that are ready, then re-order the results at the end to make it appear that the instructions were processed as normal."

Now memory fences are said to be required when using multicore platforms, because owing to Out of Order execution, wrong value of x can be printed here.

Processor #1:
 while f == 0
  ;
 print x; // x might not be 42 here

Processor #2:
 x = 42;
 // Memory fence required here
 f = 1

Now my question is, since Out of Order Processors (Cores in case of MultiCore Processors I assume) always retire the results In-Order, then what is the necessity of Memory fences. Don't the cores of a multicore processor sees results retired from other cores only or they also see results which are in-flight?

I mean in the example I gave above, when Processor 2 will eventually retire the results, the result of x should come before f, right? I know that during out of order execution it might have modified f before x but it must have not retired it before x, right?

Now with In-Order retiring of results and cache coherence mechanism in place, why would you ever need memory fences in x86?

They answered 8/9, 2011 at 10:52 Comment(1)
Note that memory fences always come in pairs in correct code: When two threads communicate, each thread has to perform some ordering of memory accesses (= fences). Usually, one of these fences has release semantics, the other has acquire semantics. In your pseudocode, Processor #2 should execute a write fence between the assignments (release semantics), and Processor #1 should add a read fence (acquire semantics) between the loop and the print. Some fences may be unnecessary on specific platforms, but any source code should contain both fences (which may compile to noops).Milled
R
16

This tutorial explains the issues: http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-95-7.pdf

FWIW, where memory ordering issues happen on modern x86 processors, the reason is that while the x86 memory consistency model offers quite strong consistency, explicit barriers are needed to handle read-after-write consistency. This is due to something called the "store buffer".

That is, x86 is sequentially consistent (nice and easy to reason about) except that loads may be reordered wrt earlier stores. That is, if the processor executes the sequence

store x
load y

then on the processor bus this may be seen as

load y
store x

The reason for this behavior is the afore-mentioned store buffer, which is a small buffer for writes before they go out on the system bus. Load latency is, OTOH, a critical issue for performance, and hence loads are permitted to "jump the queue".

See Section 8.2 in http://download.intel.com/design/processor/manuals/253668.pdf

Raddled answered 8/9, 2011 at 11:0 Comment(7)
Janneb, Can you please explain store buffer a little bit and why are they important in this context?They
Doesn't cache coherence make sure that there is read-after-write consistency in x86?They
@MetallicPriest: Ah, on second though, I suspect barriers are not actually needed in your specific example. I edited the post to reflect this, and also added an explanation of the permitted reordering in the x86 memory model.Raddled
@janneb, he took the example from the wikipedia article on memory barriers.Sayles
@Tony The Tiger: The point is that the x86 memory model does not allow writes to be reordered wrt other writes, hence the barrier is not necessary on x86.Raddled
Minus one for FWIW and OTOH.Semiautomatic
I wrote an answer that explains what store buffers are for in CPU-architecture terms. Also How does memory reordering help processors and compilers? explains why allowing StoreLoad reordering in hardware is essential for performance.Band
S
8

The memory fence ensures that all changes to variables before the fence are visible to all other cores, so that all cores have an up to date view of the data.

If you don't put a memory fence, the cores might be working with wrong data, this can be seen especially in scenario's, where multiple cores would be working on the same datasets. In this case you can ensure that when CPU 0 has done some action, that all changes done to the dataset are now visible to all other cores, whom can then work with up to date information.

Some architectures, including the ubiquitous x86/x64, provide several memory barrier instructions including an instruction sometimes called "full fence". A full fence ensures that all load and store operations prior to the fence will have been committed prior to any loads and stores issued following the fence.

If a core were to start working with outdated data on the dataset, how could it ever get the correct results? It couldn't no matter if the end result were to be presented as-if all was done in the right order.

The key is in the store buffer, which sits between the cache and the CPU, and does this:

Store buffer invisible to remote CPUs

Store buffer allows writes to memory and/or caches to be saved to optimize interconnect accesses

That means that things will be written to this buffer, and then at some point will the buffer be written to the cache. So the cache could contain a view of data that is not the most recent, and therefore another CPU, through cache coherency, will also not have the latest data. A store buffer flush is necessary for the latest data to be visible, this, I think is essentially what the memory fence will cause to happen at hardware level.

EDIT:

For the code you used as an example, Wikipedia says this:

A memory barrier can be inserted before processor #2's assignment to f to ensure that the new value of x is visible to other processors at or prior to the change in the value of f.

Sayles answered 8/9, 2011 at 11:1 Comment(0)
A
2

Just to make explicit what is implicit in the previous answers, this is correct, but is distinct from memory accesses:

CPUs can execute out of order, However they always retire the results in-order

Retirement of the instruction is separate from performing the memory access, the memory access may complete at a different time to instruction retirement.

Each core will act as if it's own memory accesses occur at retirement, but other cores may see those accesses at different times.

(On x86 and ARM, I think only stores are observably subject to this, but e.g., Alpha may load an old value from memory. x86 SSE2 has instructions with weaker guarentees than normal x86 behaviour).

PS. From memory the abandoned Sparc ROCK could in fact retire out-of-order, it spent power and transistors determining when this was harmless. It got abandoned because of power consumption and transistor count... I don't believe any general purpose CPU has been bought to market with out-of-order retirement.

Archimandrite answered 27/12, 2017 at 8:32 Comment(2)
There have been theoretical proposals for out-of-order retirement to make it possible to hide memory latency with a 1k instruction out-of-order window without just scaling up a normal ROB to an impractical 1k entries. Specifically, the kilo-instruction processor. Google found this link the paper on some random site: cgi.di.uoa.gr/~halatsis/Advanced_Comp_Arch/…. And also csl.cornell.edu/~martinez/doc/taco04.pdf.Band
And BTW, a single core sees its own memory accesses happen in order, but they don't have to wait for retirement. Store-forwarding makes it possible for a load to access recently-stored data without waiting for the store to retire and (at some point after that) commit to L1D cache. blog.stuffedcow.net/2014/01/x86-memory-disambiguationBand

© 2022 - 2024 — McMap. All rights reserved.