Out-of-order execution vs. speculative execution

Asked 1/4, 2018 at 19:33 Answered 4/4, 2018 at 22:7

Solved cpu-architecture speculative-execution

I have read the wikipedia page about out-of-order execution and speculative exectution.

What I fail to understant though are the similarities and differences. It seems to me that speculative execution uses out-of-order execution when it has not determined the value of a condition for example.

The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.

I'd like to get some clarification about this.

Klute answered 1/4, 2018 at 19:33 Comment(1)

To me, out-of-order execution is a form of speculative execution (speculating that the in-flight insts won't have side effects relevant to the new insts). Another form of spec exec is branch prediction, another is walking the page tables ahead of time. Technically, OOO is the ability to move execution between different dependency chains, so skipping an inst and moving to the next one; but this is betting that the older insts won't fault, for example, thus a speculation. – Inna 1/4, 2018 at 21:34

Speculative execution and out-of-order execution are orthogonal. One could design a processor that is OoO but not speculative or speculative but in-order. OoO execution is an execution model in which instructions can be dispatched to execution units in an order that is potentially different from the program order. However, the instructions are still retired in program order so that the program's observed behavior is the same as the one intuitively expected by the programmer. (Although it's possible to design an OoO processor that retires instructions in some unnatural order with certain constraints. See the simulation-based study on this idea: Maximizing Limited Resources: a Limit-Based Study and Taxonomy of Out-of-Order Commit).

Speculative execution is an execution model in which instructions can be fetched and enter the pipeline and begin execution without knowing for sure that they will indeed be required to execute (according to the control flow of the program). The term is often used to specifically refer to speculative execution in the execution stage of the pipeline. The Meltdown paper does define these terms on page 3:

In this paper, we refer to speculative execution in a more restricted meaning, where it refers to an instruction sequence following a branch, and use the term out-of-order execution to refer to any way of getting an operation executed before the processor has committed the results of all prior instructions.

The authors here specifically refer to having branch prediction with executing instructions past predicted branches in the execution units. This is commonly the intended meaning of the term. Although it's possible to design a processor that executes instructions speculatively without any branch prediction by using other techniques such as value prediction and speculative memory disambiguation. This would be speculation on data or memory dependencies rather than on control. An instruction could be dispatched to an execution unit with an incorrect operand or that loads the wrong value. Speculation can also occur on the availability of execution resources, on the latency of an earlier instruction, or on the presence of a needed value in a particular unit in the memory hierarchy.

Note that instructions can be executed speculatively, yet in-order. When the decoding stage of the pipeline identifies a conditional branch instruction, it can speculate on the branch and its target and fetch instructions from the predicted target location. But still, instructions can also be executed in-order. However, note that once the speculated conditional branch instruction and the instructions fetched from the predicted path (or both paths) reach the issue stage, none of them will be issued until all earlier instructions are issued. The Intel Bonnell microarchitecture is an example of a real processor that is in-order and supports branch prediction.

Processors designed to carry out simple tasks and used in embedded systems or IoT devices are typically neither speculative nor OoO. Desktop and server processors are both speculative and OoO. Speculative execution is particularly beneficial when used with OoO.

The confusion came when I read the papers of Meltdown and Spectre and did additional research. It is stated in the Meltdown paper that Meltdown is based on out-of-order execution, while some other resources including the wiki page about sepeculative execution state that Meltdown is based on speculative execution.

The Meltdown vulnerability as described in the paper requires both speculative and out-of-order execution. However, this is somewhat a vague statement since there are many different speculative and out-of-order execution implementations. Meltdown doesn't work with just any type of OoO or speculative execution. For example, ARM11 (used in Raspberry Pis) supports some limited OoO and speculative execution, but it's not vulnerable.

See Peter's answer for more details on Meltdown and his other answer.

Commercialize answered 1/4, 2018 at 22:35 Comment(3)

How does OoO exec without speculation work? Instructions need to wait for earlier independent loads/stores to be known non-faulting, even if the data isn't ready (e.g. wait for TLB hit, but not for cache miss)? ARM division instructions don't fault even on divide by zero IIRC, so at least it wouldn't have to stall for that. (I think we discussed this at some point, but I forget what your answer was.) – Sandasandakan 11/9, 2021 at 8:52

Also, note that with an in-order pipeline, speculative fetch / decode doesn't let the speculation reach the exec stage, so instructions after a mispredicted branch never actually get executed. Seems a bit optimistic to call this speculative execution. (Unless you mean that execution of a branch instruction can start in-order, but not complete for a long time so some later instructions get a chance to execute) – Sandasandakan 11/9, 2021 at 8:53

(Update: ok yeah, that edit makes more sense to me, no longer suggesting non-speculative OoO exec.) – Sandasandakan 13/9, 2021 at 20:37

I'm still having hard time figuring out, how Meltdown uses speculative execution. The example in the paper (the same one I mentioned here earlier) uses IMO only OoO - @Name in a comment

Meltdown is based on Intel CPUs optimistically speculating that loads won't fault, and that if a faulting load reaches the load ports, that it was the result of an earlier mispredicted branch. So the load uop gets marked so it will fault if it reaches retirement, but execution continues speculatively using data the page table entry says you aren't allowed to read from user-space.

Instead of triggering a costly exception-recovery when the load executes, it waits until it definitely reaches retirement, because that's a cheap way for the machinery to handle the branch miss -> bad load case. In hardware, it's easier for the pipe to keep piping unless you need it to stop / stall for correctness. e.g. A load where there's no page-table entry at all, and thus a TLB miss, has to wait. But waiting even on a TLB hit (for an entry with permissions that block using it) would be added complexity. Normally a page-fault is only ever raised after a failed page walk (which doesn't find an entry for the virtual address), or at retirement of a load or store that failed the permissions of the TLB entry it hit.

In a modern OoO pipelined CPU, all instructions are treated as speculative until retirement. Only at retirement do instructions become non-speculative. The Out-of-Order machinery doesn't really know or care whether it's speculating down one side of a branch that was predicted but not executed yet, or speculating past potentially-faulting loads. "Speculating" that loads don't fault or ALU instructions don't raise exceptions happens even in CPUs that aren't really considered speculative, but fully out-of-order execution turns that into just another kind of speculation.

I'm not too worried about an exact definition for "speculative execution", and what counts / what doesn't. I'm more interested in how modern out-of-order designs actually work, and that it's actually simpler to not even try to distinguish speculative from non-speculative until the end of the pipeline. This answer isn't even trying to address simpler in-order pipelines with speculative instruction-fetch (based on branch prediction) but not execution, or anywhere in between that and full-blown Tomasulo's algorithm with a ROB + scheduler with OoO exec + in-order retirement for precise exceptions.

For example, only after retirement can a store ever commit from the store buffer to L1d cache, not before. And to absorb short bursts and cache misses, it doesn't have to happen as part of retirement either. So one of the only non-speculative out-of-order things is committing stores to L1d; they have definitely happened as far as the architectural state is concerned, so they have to be completed even if an interrupt / exception happens.

The fault-if-reaching-retirement mechanism is a good way to avoid expensive work in the shadow of a branch mispredict. It also gives the CPU the right architectural state (register values, etc.) if the exception does fire. You do need that whether or not you let the OoO machinery keep churning on instructions beyond a point where you've detected an exception.

Branch-misses are special: there are buffers that record micro-architectural state (like register-allocation) on branches, so branch-recovery can roll back to that instead of flushing the pipeline and restarting from the last known-good retirement state. Branches do mispredict a fair amount in real code. Other exceptions are very rare.

Modern high-performance CPUs can keep (out-of-order) executing uops from before a branch miss, while discarding uops and execution results from after that point. Fast recovery is a lot cheaper than discarding and restarting everything from a retirement state that's potentially far behind the point where the mispredict was discovered.

E.g. in a loop, the instructions that handle the loop counter might get far ahead of the rest of the loop body, and detect the mispredict at the end soon enough to redirect the front-end and maybe not lose much real throughput, especially if the bottleneck was the latency of a dependency chain or something other than uop throughput.

This optimized recovery mechanism is only used for branches (because the state-snapshot buffers are limited), which is why branch misses are relatively cheap compared to full pipeline flushes. (e.g. on Intel, memory-ordering machine clears, performance counter machine_clears.memory_ordering: What are the latency and throughput costs of producer-consumer sharing of a memory location between hyper-siblings versus non-hyper siblings?)

Exceptions are not unheard-of, though; page-faults do happen in the normal course of operation. e.g. store to a read-only page triggers copy-on-write. Load or store to an unmapped page triggers page-in or handling the lazy mapping. But thousands to millions of instructions usually run between every page fault even in a process that's allocating new memory frequently. (1 per micro or milli-second on a 1GHz CPU). In code that doesn't map new memory, you can go far longer without exceptions. Mostly just a timer interrupt occasionally in pure number crunching without I/O.

But anyway, you don't want to trigger a pipeline flush or anything expensive until you're sure that an exception will really fire. And that you're sure you have the right exception. e.g. maybe the load address for an earlier faulting load wasn't ready as soon, so the first faulting load to execute wasn't the first in program order. Waiting until retirement is a cheap way to get precise exceptions. Cheap in terms of additional transistors to handle this case, and letting the usual in-order retirement machinery figure out exactly which exception fires is fast.

The useless work done executing instructions after an instruction marked to fault on retirement costs a tiny bit of power, and isn't worth blocking because exceptions are so rare.

This explains why it makes sense to design hardware that was vulnerable to Meltdown in the first place. Obviously it's not safe to keep doing this, now that Meltdown has been thought of.

Fixing Meltdown cheaply

We don't need to block speculative execution after a faulting load; we just need to make sure it doesn't actually use sensitive data. It's not the load succeeding speculatively that's the problem, Meltdown is based on the following instructions using that data to produce data-dependent microarchitectural effects. (e.g. touching a cache line based on the data).

So if the load ports mask the loaded data to zero or something as well as setting the fault-on-retirement flag, execution continues but can't gain any info about the secret data. This should take about 1 extra gate delay of critical path, which is probably possible in the load ports without limiting the clock speed or adding an extra cycle of latency. (1 clock cycle is long enough for logic to propagate through many AND/OR gates within a pipeline stage, e.g. a full 64-bit adder).

Related: I suggested the same mechanism for a HW fix for Meltdown in Why are AMD processors not/less vulnerable to Meltdown and Spectre?.

Sandasandakan answered 4/4, 2018 at 22:7 Comment(9)

@KrazyGlew: Any idea what kind of microarchitectural benefits you'd get from doing a load if there's any kind of TLB hit, even one with insufficient permissions? If early P6 before Conroe/Merom didn't do it, I guess it's less obvious or significant than I thought. – Sandasandakan 25/4, 2018 at 22:43

Just so I am sure I understand this, you mean that the problem is that data is brought to the cache even if the TLB finds out that we are not allowed to access it after looking at the PTE? – Padron 30/11, 2018 at 19:57

@abjoshi: In Meltdown the problem is that an already-hot line in L1d can report a hit and provide the actual data for speculative execution of later instructions, even though the TLB entry says only supervisor (kernel mode) code should be allowed to read this page. The speculative access that brings a new line into cache is to an array we are allowed to read. (And will later read non-speculatively for the cache-timing side channel to turn that microarchitectural state into architectural state- a number in a register in non-speculative code.) – Sandasandakan 1/12, 2018 at 0:15

See blog.stuffedcow.net/2018/05/meltdown-microarchitecture for more details about how Meltdown works. I'm not sure if Meltdown works on cache misses on the "secret" data. – Sandasandakan 1/12, 2018 at 0:16

Thanks for the link. So you mean that the line has to be already in the L1D cache? Also, I guess there are 2 scenarios on a cache-miss: 1. page is mapped in the TLB 2. page is not mapped, so a TLB miss. In either case, the TLB eventually finds the pte and assuming that we are not allowed to access it, informs the core which marks the instruction for exception. My confusion is whether the data is actually brought into the cache in either case at all, and if so, who sends the request to memory, the MMU or the cache controller? – Padron 1/12, 2018 at 1:21

@abjoshi: if any data is going to be fetched, the request will come from L1d, not from the page-walk hardware. The page-walker will only fetch page directories and PTEs on a TLB miss. (There isn't a separate "MMU", the L1dTLB is very tightly coupled into the L1d cache.) As the "Results" table in blog.stuffedcow.net/2018/05/meltdown-microarchitecture shows, only data that was already in L1 ever leaks on Intel. quote: "When the data is not in the L1 cache (even when in the L2), the load returns zero and does not leak information". – Sandasandakan 1/12, 2018 at 1:28

Ok. But the question of whether any data is fetched at all on a miss + permission violation still stands, i.e. will the L1D make a normal request even when data access is not permitted? – Padron 1/12, 2018 at 1:34

@abjoshi: I'm not sure. IDK if the first attempt will populate the cache line on some uarches, and then a 2nd attempt at a Meltdown attack will let you actually read it. I wouldn't be surprised if Henry's blog mentions that somewhere, but I didn't scan it thoroughly. – Sandasandakan 1/12, 2018 at 1:36

Let us continue this discussion in chat. – Padron 1/12, 2018 at 1:38

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Fixing Meltdown cheaply

Recommended topics

Hot tags