Can out-of-order execution lead to speculative memory accesses?
Asked Answered
F

4

5

When a out-of-order processor encounters something like

LOAD R1, 0x1337
LOAD R2, $R1
LOAD R3, 0x42

Assuming that all accesses will result in a cache miss, can the processor ask the memory controller for the contents of 0x42 before the it asks for the content of $R1 or even 0x1337? If so, assuming that accessing $R1 will result in a exception (e.g., segmentation fault), we can consider that 0x42 was loaded speculatively, correct?

And by the way, when a load-store unit sends a request to the memory controller, can it send a second request before receiving the answer to the previous one?

My question doesn't target any architecture in particular. Answers related to any mainstream architecture are welcomed.

Freed answered 20/9, 2012 at 12:47 Comment(0)
A
6

Answer to your question depends on the memory ordering model of your CPU, which is not the same as the CPU allowing out of order execution. If the CPU implements Total store ordering (eg x86 or Sparc) then the answer to your question is 0x42 will not be loaded before 0x1337

If the cpu implements a relaxed memory model (eg IA-64, PowerPC, alpha), then in the absence of a memory fence instruction all bets are off as to which will be accessed first. This should be of little relevance unless you are doing IO, or dealing with multi-threaded code.

you should note that some CPU's (eg Itanium) do have relaxed memory models (so reads may be out of order) but do NOT have any out of order execution logic since they expect the compiler to order the instructions and speculative instructions in an optimal way rather than spend silicon space on OOE

Angelika answered 7/10, 2012 at 22:52 Comment(7)
Is this still true with today's NUMA x86? I can't think of a particularly efficient way to enforce write ordering across different memory controllers.Alwitt
Yes. It's true because the CPU's have a cache coherency protocol on the x86, as explained in detail in the intel developers manual. Intel can't change this without breaking binary compatibility with existing software (including my own). www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf or other intel documents. It's also why if you try and have different CPU's access different memory locations that are next to eachother and hence on the same cache line the performance is likely to be ghastlyAngelika
Umm... I can assure you that Intel x86 processors perform speculative memory accesses. Since P6 circa 1996. However, they only perform such speculative memory accesses to memory locations that the OS has indicated are ordinary memory (WB), not uncached memory that may have side effects (UC), marked using the MTRRs. And processors that do such speculation typically have logic to detect violations of the memory ordering model. // I.e. Intel x86 does speculative loads, but detects violations so that most programmers won't notice. But some low level programmers may notice.Chaisson
@ Krazy Glew x86 CPU's can indeed prefetch what they like into the cache any time they like. The question states cache misses though, and the CPU can't reorder reads from memory (or cache) into registers, since if it did you have a memory ordering violation if another core in the system writes to one of these locations. Any other CPU in the system can use 0x42 to indicate that it's finished some result stored in 0x1337 in this example, a fact which the programmer can use immediately, or after executing 100000 more instructions. (this in total contrast to RMO)Angelika
Looks like today this "feature" can be exploited, due to the way virtual memory is implemented in Intel P6 and newer processors: news.ycombinator.com/item?id=16046636Jevons
@camelccc: Krazy Glew is talking about Why flush the pipeline for Memory Order Violation caused by other logical processors? not just HW prefetch. The CPU truly does reorder load execution, but then checks if the eventual result was legal according to the ISA's on-paper memory-model guarantees. (i.e. that the cache line loaded from earlier is still valid and thus still contains the data we're now allowed to load). If not, nuke the in-flight instructions that depended on this possibly-unsafe speculation and roll back to a known safe state.Conics
So you get the perf of relaxed load ordering (most of the time) while still maintaining the memory-model rules where every load is effectively an acquire load.Conics
F
4

This would seem to be the a logical conclusion for superscalor CPUs with multiple load-store units too. Multi-channel memory controllers are pretty common these days.

In the case of out-of-order instruction execution, an enormous amount of logic is expended in determining whether instructions have dependancies on others in the stream - not just register dependancies but also operations on memory as well. There's also an enormous amount of logic for handling exceptions: the CPU needs to complete all instructions in the stream up to the fault (or alternatively, offload some parts of this onto the operating system).

In terms of the programming model seen by most applications, the effects are never apparent. As seen by memory, it's implicit that loads will not always happen in the sequence expected - but this is the case any way when caches are in use.

Clearly, in circumstances where the order of loads and stores does matter - for instance in accessing device registers, OOE must be disabled. The POWER architecture has the wonderful EIEIO instruction for this purpose.

Some members of the ARM Cortex-A family offer OOE - I suspect with the power constraints of these devices, and the apparent lack of instructions for forcing ordering, that load-stores always complete in order

Fortunate answered 20/9, 2012 at 19:8 Comment(4)
Marko, thank you for your answer. As you state, I'm stating a logical assumption for OOE/superscalar CPUs, but I really want to know if it is really true or not :)Hotblooded
The ARM (v6?) Architecture Reference Manual refers to something called called DataMemoryBarrier (perhaps a pseudo-instruction or macro?) which is actually a write to CP15. I'm not sure if it's privileged (it's next to things like cache-disabling which ought to be privileged), but it's there.Alwitt
The dmb (pseudo)instruction is memory barrier (aka memory fence). It's the only such instruction provided in ARMv7 (other architectures provide much more specific load and store fences). This does indeed force completion ordering, but also protects against other hazards where hardware (or a thread running on another core) is relying on effects of stores becoming visible. It's a non-privilidged instruction, and there are a few scenarios in user-space code where you needs it - such as implementing atomic operations.Fortunate
ARM also provides ISB and DSB, you can have a look at ARM ARM for their use cases and their effects on pipeline, system buses.Bonds
C
2

Related for x86: Why flush the pipeline for Memory Order Violation caused by other logical processors?. The observable result will obey x86 ordering rules, but microarchitecturally yes it can load early. (And of course that's from cache; HW prefetch is different).

OoO exec CPUs truly do reorder load execution if the address isn't ready for one load. Or if it misses in cache, then later loads can run before data arrives for this one. But on x86, to maintain correctness wrt. the strong memory model (program order + a store buffer with store forwarding), the core checks if the eventual result was legal according to the ISA's on-paper memory-model guarantees. (i.e. that the cache line loaded from earlier is still valid and thus still contains the data we're now allowed to load). If not, nuke the in-flight instructions that depended on this possibly-unsafe speculation and roll back to a known safe state.

So modern x86 gets the perf of relaxed load ordering (most of the time) while still maintaining the memory-model rules where every load is effectively an acquire load. But at the cost of pipeline nukes if you do something the pipeline doesn't like, e.g. false sharing (which is already bad enough).

Other CPUs with a strong memory model (Sparc TSO) might not be this aggressive. Weak memory models allow later loads to complete early.

Of course this is reading from cache; demand-load requests are seen by the / a memory controller only on cache miss. But HW prefetchers can access memory asynchronously from the CPU; that's how they get data into cache ahead of when the CPU runs an instruction that loads it, ideally avoiding a cache miss at all.


And yes, the memory subsystem is pipelined, like 12 to 16 outstanding requests per core in Skylake. (12 LFBs for L1<->L2, and IIRC 16 superqueue entries in the L2.)

Conics answered 29/12, 2020 at 12:19 Comment(1)
related: preshing.com/20120710/… describes some mechanism by which CPUs can end up reordering their accesses to coherent shared cache. See also How is load->store reordering possible with in-order commit? for one of the more surprising ones, and How does Intel X86 implements total order over storesConics
G
1

A compliant SPARC processor must implement TSO but may also implement RMO and PSO. You need to know what mode your OS is running in unless you happen to know your specific hardware platform has not implemented RMO and PSO.

Glycoside answered 8/2, 2013 at 7:23 Comment(2)
Thank you Spark Ler. I was aware of that fact but is good to have it here for the sake of completion.Hotblooded
True, BUT all versions of Solaris run the thing in TSO only. The ultrasparc III and later only implement TSO, with the result that I'm pretty sure that linux dropped RMO and PSO support years ago. If anyone knows of a Sparc configuration that supports RMO I'd like to know about it.Angelika

© 2022 - 2024 — McMap. All rights reserved.